In Machine Learning, Experimentation refers to the systematic process of designing, executing, and analyzing different configurations to identify the optimal settings that performs best on a given task. Experimentation is learning by doing. It involves systematically changing parameters, evaluating results with metrics, and comparing different approaches to find the best solution; essentially, it’s the practice of testing and refining machine learning models through controlled experiments to improve their performance.
The key is to modify only one or a few variables at a time to isolate the impact of each change and understand its effect on model performance. In the assignment I will conduct at least 6 experiments.
# Load the data
bank <- read.csv("bank-full.csv", sep = ';')
# View the first few rows of the dataset
head(bank)## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
bank <- bank %>%
mutate(across(.cols = everything(),
.fns = ~replace(., . == "unknown", NA)))
colSums(is.na(bank))## age job marital education default balance housing loan
## 0 288 0 1857 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 13020 0 0 0 0 0 0 36959
## y
## 0
## [1] 1
There is only one duplicate value and it will be removed.
Evaluating the impact of using under-sampling to balance the dataset on the performance of a Decision Tree model.
# Data Sampling
set.seed(123)
data_balanced <- ovun.sample(y ~ ., data = bank, method = "under", N = 20000)$data
# Splitting data into training and testing sets
train_index <- createDataPartition(data_balanced$y, p = 0.8, list = FALSE)
train_data <- data_balanced[train_index, ]
test_data <- data_balanced[-train_index, ]
# Ensure the target variable 'y' is a factor
train_data$y <- factor(train_data$y, levels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c("no", "yes"))
# Training the Decision Tree model
tree_model <- rpart(y ~ ., data = train_data, method = "class")
# Making predictions
predictions <- predict(tree_model, test_data, type = "class")
predictions <- factor(predictions, levels = c("no", "yes"))
# Evaluating the model
results <- confusionMatrix(predictions, test_data$y)
print(results)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 2649 352
## yes 346 652
##
## Accuracy : 0.8255
## 95% CI : (0.8133, 0.8371)
## No Information Rate : 0.7489
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5349
##
## Mcnemar's Test P-Value : 0.8499
##
## Sensitivity : 0.8845
## Specificity : 0.6494
## Pos Pred Value : 0.8827
## Neg Pred Value : 0.6533
## Prevalence : 0.7489
## Detection Rate : 0.6624
## Detection Prevalence : 0.7504
## Balanced Accuracy : 0.7669
##
## 'Positive' Class : no
##
Accuracy is 0.8255, indicating that the model performed well, significantly above the No Information Rate of 0.7489, which shows that the model predictions are better than random guessing. Under-sampling helped in balancing the class distribution, leading to a model that performs reasonably well, especially in identifying the majority class. However, improvements could be sought in enhancing specificity.
I’ll determine the effect of pruning on a Decision Tree’s ability to generalize by reducing overfitting.
# Preparing the data
set.seed(123) # for reproducibility
train_index <- createDataPartition(bank$y, p = 0.8, list = FALSE)
train_data <- bank[train_index, ]
test_data <- bank[-train_index, ]
# Ensure the target variable 'y' is a factor with levels
train_data$y <- factor(train_data$y, levels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c("no", "yes"))
# Train a more complex tree to find the optimal cp value
complex_tree <- rpart(y ~ ., data = train_data, method = "class", control = rpart.control(cp = 0.01))
printcp(complex_tree) # Displays the CP table for choosing the best CP##
## Classification tree:
## rpart(formula = y ~ ., data = train_data, method = "class", control = rpart.control(cp = 0.01))
##
## Variables actually used in tree construction:
## [1] duration
##
## Root node error: 4017/34554 = 0.11625
##
## n= 34554
##
## CP nsplit rel error xerror xstd
## 1 0.027757 0 1.00000 1.00000 0.014832
## 2 0.010000 2 0.94449 0.94922 0.014499
# Prune the tree using a higher cp value based on the cp table output
pruned_tree <- prune(complex_tree, cp = 0.015) # Adjust this based on the cp table output
# Make predictions with the pruned tree
pruned_predictions <- predict(pruned_tree, test_data, type = "class")
pruned_predictions <- factor(pruned_predictions, levels = c("no", "yes"))
# Evaluate the pruned model
confusionMatrix(pruned_predictions, test_data$y)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7475 789
## yes 159 215
##
## Accuracy : 0.8903
## 95% CI : (0.8835, 0.8968)
## No Information Rate : 0.8838
## P-Value [Acc > NIR] : 0.03046
##
## Kappa : 0.2657
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9792
## Specificity : 0.2141
## Pos Pred Value : 0.9045
## Neg Pred Value : 0.5749
## Prevalence : 0.8838
## Detection Rate : 0.8654
## Detection Prevalence : 0.9567
## Balanced Accuracy : 0.5967
##
## 'Positive' Class : no
##
Accuracy increased to 0.8903, showing a clear benefit from pruning in enhancing model accuracy. Pruning significantly improved the Decision Tree’s performance by reducing overfitting, evidenced by the higher accuracy and balanced accuracy. However, future efforts should focus on strategies to improve the identification of the minority class.
I’ll assess the baseline performance of a Random Forest model without any tuning.
# Preparing data
set.seed(123)
train_index <- createDataPartition(bank$y, p = 0.8, list = FALSE)
train_data <- bank[train_index, ]
test_data <- bank[-train_index, ]
# Ensure the target variable 'y' is a factor
train_data$y <- factor(train_data$y, levels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c("no", "yes"))
# Training the Random Forest model
rf_model <- randomForest(y ~ ., data = train_data, ntree = 100)
# Making predictions
rf_predictions <- predict(rf_model, test_data)
# Evaluating the model
rf_results <- confusionMatrix(rf_predictions, test_data$y)
rf_importance <- importance(rf_model) # Obtaining variable importance
print(rf_results)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7430 667
## yes 204 337
##
## Accuracy : 0.8992
## 95% CI : (0.8926, 0.9054)
## No Information Rate : 0.8838
## P-Value [Acc > NIR] : 2.894e-06
##
## Kappa : 0.3863
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9733
## Specificity : 0.3357
## Pos Pred Value : 0.9176
## Neg Pred Value : 0.6229
## Prevalence : 0.8838
## Detection Rate : 0.8602
## Detection Prevalence : 0.9374
## Balanced Accuracy : 0.6545
##
## 'Positive' Class : no
##
## MeanDecreaseGini
## age 816.94494
## job 347.82100
## marital 153.64204
## education 156.03972
## default 13.44796
## balance 894.71597
## housing 234.38753
## loan 81.89238
## duration 2125.60711
## campaign 284.48720
## pdays 538.83737
## previous 255.87530
The Random Forest model demonstrated a robust performance, with high accuracy (0.8992) and providing insights into feature importance. Duration was identified as the most influential feature, followed by age and balance, highlighting key drivers of predictions.
Exploring the impact of increasing the number of trees and adjusting the number of variables considered at each split on model performance.
# Setting hyperparameters
ntree_value <- 500 # Choosing a specific number of trees
mtry_value <- round(sqrt(ncol(train_data))) # Number of variables tried at each split
# Training the Random Forest model with tuned parameters
tuned_rf_model <- randomForest(y ~ ., data = train_data, ntree = ntree_value, mtry = mtry_value, method = "class")
# Making predictions
tuned_rf_predictions <- predict(tuned_rf_model, test_data)
# Evaluating the model
tuned_rf_results <- confusionMatrix(tuned_rf_predictions, test_data$y)
tuned_rf_importance <- importance(tuned_rf_model) # Obtaining variable importance
print(tuned_rf_results)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7395 645
## yes 239 359
##
## Accuracy : 0.8977
## 95% CI : (0.8911, 0.904)
## No Information Rate : 0.8838
## P-Value [Acc > NIR] : 2.249e-05
##
## Kappa : 0.3958
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9687
## Specificity : 0.3576
## Pos Pred Value : 0.9198
## Neg Pred Value : 0.6003
## Prevalence : 0.8838
## Detection Rate : 0.8561
## Detection Prevalence : 0.9308
## Balanced Accuracy : 0.6631
##
## 'Positive' Class : no
##
## MeanDecreaseGini
## age 949.72548
## job 413.39190
## marital 179.07098
## education 182.70440
## default 14.65688
## balance 1112.55389
## housing 245.36679
## loan 95.48039
## duration 2356.83377
## campaign 325.41178
## pdays 581.01435
## previous 239.66907
Accuracy slightly decreased to 0.8977, suggesting additional trees did not contribute to predictive accuracy and might have introduced complexity without benefit. Tuning provided minimal improvements, indicating that the baseline settings were already near optimal for this dataset. Further exploration with other hyperparameters might be beneficial.
Performing default performance of an AdaBoost model:
# Preparing data
set.seed(123)
train_index <- createDataPartition(bank$y, p = 0.8, list = FALSE)
train_data <- bank[train_index, ]
test_data <- bank[-train_index, ]
# Ensure the target variable 'y' is a factor
train_data$y <- factor(train_data$y, levels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c("no", "yes"))
# Training the AdaBoost model
ada_model <- ada(y ~ ., data = train_data)
# Making predictions
ada_predictions <- predict(ada_model, test_data, type = "response")
# Evaluating the model
ada_results <- confusionMatrix(ada_predictions, test_data$y)
print(ada_results)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7423 682
## yes 211 322
##
## Accuracy : 0.8966
## 95% CI : (0.89, 0.903)
## No Information Rate : 0.8838
## P-Value [Acc > NIR] : 8.251e-05
##
## Kappa : 0.3681
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9724
## Specificity : 0.3207
## Pos Pred Value : 0.9159
## Neg Pred Value : 0.6041
## Prevalence : 0.8838
## Detection Rate : 0.8593
## Detection Prevalence : 0.9383
## Balanced Accuracy : 0.6465
##
## 'Positive' Class : no
##
The accuracy (0.8966) is similar to Random Forest, indicating high overall performance. The kappa value of 0.3681 and a sensitivity of 0.9724 combined with a lower specificity (0.3207) suggest that while the model is excellent at identifying the positive class, it struggles with false positives.
Determining the effects of increasing the number of boosting iterations on AdaBoost’s accuracy and specificity.
# Setting hyperparameters
iter_values <- 50 # Number of iterations
# Training the AdaBoost model with tuned parameters
tuned_ada_model <- ada(y ~ ., data = train_data, iter = iter_values)
# Making predictions
tuned_ada_predictions <- predict(tuned_ada_model, test_data, type = "response")
# Evaluating the model
tuned_ada_results <- confusionMatrix(as.factor(tuned_ada_predictions), as.factor(test_data$y))
print(tuned_ada_results)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7426 686
## yes 208 318
##
## Accuracy : 0.8965
## 95% CI : (0.8899, 0.9029)
## No Information Rate : 0.8838
## P-Value [Acc > NIR] : 9.476e-05
##
## Kappa : 0.3649
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9728
## Specificity : 0.3167
## Pos Pred Value : 0.9154
## Neg Pred Value : 0.6046
## Prevalence : 0.8838
## Detection Rate : 0.8597
## Detection Prevalence : 0.9391
## Balanced Accuracy : 0.6447
##
## 'Positive' Class : no
##
The accuracy remained almost unchanged at 0.8965, indicating that additional iterations did not significantly improve performance.Kappa showed a slight decrease to 0.3649, suggesting that the model’s agreement beyond chance is stable but not improved by more iterations.
I’ll create a chart visualizing the accuracy from each experiment for better comparative analysis.
# Sample data for visualization
results <- data.frame(
Algorithm = c("Decision Tree", "Decision Tree", "Random Forest", "Random Forest", "AdaBoost", "AdaBoost"),
Experiment = c("Baseline", "Tuned", "Baseline", "Tuned", "Baseline", "Tuned"),
Accuracy = c(0.8255, 0.8903, 0.8992, 0.8977, 0.8966, 0.8965)
)
# Plotting the results
library(ggplot2)
ggplot(results, aes(x = Experiment, y = Accuracy, fill = Algorithm)) +
geom_bar(stat = "identity", position = position_dodge()) +
ggtitle("Accuracy Across Different Experiments") +
xlab("Experiment Type") +
ylab("Accuracy") +
theme_minimal()Decision Tree shows a significant increase in accuracy when tuned. The baseline accuracy is the lowest among the three algorithms. Random Forest exhibits high accuracy in both baseline and tuned experiments, with only a slight decrease in the tuned setup. This suggests that Random Forest is robust to overfitting, given its ensemble nature, and performs well even without extensive tuning. AdaBoost maintains the highest accuracy across both experiments, slightly dropping in the tuned scenario. This minor decrease could imply that the baseline parameters were already near optimal, and further tuning did not yield significant benefits. Random Forest and AdaBoost show high robustness, evidenced by their consistent performance across different settings. They are less sensitive to overfitting, making them reliable for various applications.