Introduction

In Machine Learning, Experimentation refers to the systematic process of designing, executing, and analyzing different configurations to identify the optimal settings that performs best on a given task. Experimentation is learning by doing. It involves systematically changing parameters, evaluating results with metrics, and comparing different approaches to find the best solution; essentially, it’s the practice of testing and refining machine learning models through controlled experiments to improve their performance.

The key is to modify only one or a few variables at a time to isolate the impact of each change and understand its effect on model performance. In the assignment I will conduct at least 6 experiments.

Load Data

# Load the data
bank <- read.csv("bank-full.csv", sep = ';')

# View the first few rows of the dataset
head(bank)
# Summary of the data
summary(bank)
##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Check data types
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

Data Preparation

bank <- bank %>% 
  mutate(across(.cols = everything(),
                .fns = ~replace(., . == "unknown", NA)))

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0       288         0      1857         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##     13020         0         0         0         0         0         0     36959 
##         y 
##         0
bank <- bank %>% 
  select(-c(contact, poutcome, day, month)) %>% 
  drop_na()
sum(duplicated(bank))
## [1] 1

There is only one duplicate value and it will be removed.

bank <- bank[!duplicated(bank), ]

Decision Tree

Experiment 1

Evaluating the impact of using under-sampling to balance the dataset on the performance of a Decision Tree model.

# Data Sampling
set.seed(123)
data_balanced <- ovun.sample(y ~ ., data = bank, method = "under", N = 20000)$data

# Splitting data into training and testing sets
train_index <- createDataPartition(data_balanced$y, p = 0.8, list = FALSE)
train_data <- data_balanced[train_index, ]
test_data <- data_balanced[-train_index, ]

# Ensure the target variable 'y' is a factor
train_data$y <- factor(train_data$y, levels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c("no", "yes"))

# Training the Decision Tree model
tree_model <- rpart(y ~ ., data = train_data, method = "class")

# Making predictions
predictions <- predict(tree_model, test_data, type = "class")
predictions <- factor(predictions, levels = c("no", "yes"))

# Evaluating the model
results <- confusionMatrix(predictions, test_data$y)
print(results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2649  352
##        yes  346  652
##                                           
##                Accuracy : 0.8255          
##                  95% CI : (0.8133, 0.8371)
##     No Information Rate : 0.7489          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5349          
##                                           
##  Mcnemar's Test P-Value : 0.8499          
##                                           
##             Sensitivity : 0.8845          
##             Specificity : 0.6494          
##          Pos Pred Value : 0.8827          
##          Neg Pred Value : 0.6533          
##              Prevalence : 0.7489          
##          Detection Rate : 0.6624          
##    Detection Prevalence : 0.7504          
##       Balanced Accuracy : 0.7669          
##                                           
##        'Positive' Class : no              
## 

Accuracy is 0.8255, indicating that the model performed well, significantly above the No Information Rate of 0.7489, which shows that the model predictions are better than random guessing. Under-sampling helped in balancing the class distribution, leading to a model that performs reasonably well, especially in identifying the majority class. However, improvements could be sought in enhancing specificity.

Experiment 2

I’ll determine the effect of pruning on a Decision Tree’s ability to generalize by reducing overfitting.

# Preparing the data
set.seed(123)  # for reproducibility
train_index <- createDataPartition(bank$y, p = 0.8, list = FALSE)
train_data <- bank[train_index, ]
test_data <- bank[-train_index, ]

# Ensure the target variable 'y' is a factor with levels
train_data$y <- factor(train_data$y, levels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c("no", "yes"))

# Train a more complex tree to find the optimal cp value
complex_tree <- rpart(y ~ ., data = train_data, method = "class", control = rpart.control(cp = 0.01))
printcp(complex_tree)  # Displays the CP table for choosing the best CP
## 
## Classification tree:
## rpart(formula = y ~ ., data = train_data, method = "class", control = rpart.control(cp = 0.01))
## 
## Variables actually used in tree construction:
## [1] duration
## 
## Root node error: 4017/34554 = 0.11625
## 
## n= 34554 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.027757      0   1.00000 1.00000 0.014832
## 2 0.010000      2   0.94449 0.94922 0.014499
# Prune the tree using a higher cp value based on the cp table output
pruned_tree <- prune(complex_tree, cp = 0.015)  # Adjust this based on the cp table output

# Make predictions with the pruned tree
pruned_predictions <- predict(pruned_tree, test_data, type = "class")
pruned_predictions <- factor(pruned_predictions, levels = c("no", "yes"))

# Evaluate the pruned model
confusionMatrix(pruned_predictions, test_data$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7475  789
##        yes  159  215
##                                           
##                Accuracy : 0.8903          
##                  95% CI : (0.8835, 0.8968)
##     No Information Rate : 0.8838          
##     P-Value [Acc > NIR] : 0.03046         
##                                           
##                   Kappa : 0.2657          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9792          
##             Specificity : 0.2141          
##          Pos Pred Value : 0.9045          
##          Neg Pred Value : 0.5749          
##              Prevalence : 0.8838          
##          Detection Rate : 0.8654          
##    Detection Prevalence : 0.9567          
##       Balanced Accuracy : 0.5967          
##                                           
##        'Positive' Class : no              
## 

Accuracy increased to 0.8903, showing a clear benefit from pruning in enhancing model accuracy. Pruning significantly improved the Decision Tree’s performance by reducing overfitting, evidenced by the higher accuracy and balanced accuracy. However, future efforts should focus on strategies to improve the identification of the minority class.

Random forest

Experiment 1

I’ll assess the baseline performance of a Random Forest model without any tuning.

# Preparing data
set.seed(123)
train_index <- createDataPartition(bank$y, p = 0.8, list = FALSE)
train_data <- bank[train_index, ]
test_data <- bank[-train_index, ]

# Ensure the target variable 'y' is a factor
train_data$y <- factor(train_data$y, levels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c("no", "yes"))

# Training the Random Forest model
rf_model <- randomForest(y ~ ., data = train_data, ntree = 100)

# Making predictions
rf_predictions <- predict(rf_model, test_data)

# Evaluating the model
rf_results <- confusionMatrix(rf_predictions, test_data$y)
rf_importance <- importance(rf_model)  # Obtaining variable importance

print(rf_results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7430  667
##        yes  204  337
##                                           
##                Accuracy : 0.8992          
##                  95% CI : (0.8926, 0.9054)
##     No Information Rate : 0.8838          
##     P-Value [Acc > NIR] : 2.894e-06       
##                                           
##                   Kappa : 0.3863          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9733          
##             Specificity : 0.3357          
##          Pos Pred Value : 0.9176          
##          Neg Pred Value : 0.6229          
##              Prevalence : 0.8838          
##          Detection Rate : 0.8602          
##    Detection Prevalence : 0.9374          
##       Balanced Accuracy : 0.6545          
##                                           
##        'Positive' Class : no              
## 
print(rf_importance)
##           MeanDecreaseGini
## age              816.94494
## job              347.82100
## marital          153.64204
## education        156.03972
## default           13.44796
## balance          894.71597
## housing          234.38753
## loan              81.89238
## duration        2125.60711
## campaign         284.48720
## pdays            538.83737
## previous         255.87530

The Random Forest model demonstrated a robust performance, with high accuracy (0.8992) and providing insights into feature importance. Duration was identified as the most influential feature, followed by age and balance, highlighting key drivers of predictions.

Experiment 2

Exploring the impact of increasing the number of trees and adjusting the number of variables considered at each split on model performance.

# Setting hyperparameters
ntree_value <- 500  # Choosing a specific number of trees
mtry_value <- round(sqrt(ncol(train_data)))  # Number of variables tried at each split

# Training the Random Forest model with tuned parameters
tuned_rf_model <- randomForest(y ~ ., data = train_data, ntree = ntree_value, mtry = mtry_value, method = "class")

# Making predictions
tuned_rf_predictions <- predict(tuned_rf_model, test_data)

# Evaluating the model
tuned_rf_results <- confusionMatrix(tuned_rf_predictions, test_data$y)
tuned_rf_importance <- importance(tuned_rf_model)  # Obtaining variable importance

print(tuned_rf_results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7395  645
##        yes  239  359
##                                          
##                Accuracy : 0.8977         
##                  95% CI : (0.8911, 0.904)
##     No Information Rate : 0.8838         
##     P-Value [Acc > NIR] : 2.249e-05      
##                                          
##                   Kappa : 0.3958         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9687         
##             Specificity : 0.3576         
##          Pos Pred Value : 0.9198         
##          Neg Pred Value : 0.6003         
##              Prevalence : 0.8838         
##          Detection Rate : 0.8561         
##    Detection Prevalence : 0.9308         
##       Balanced Accuracy : 0.6631         
##                                          
##        'Positive' Class : no             
## 
print(tuned_rf_importance)
##           MeanDecreaseGini
## age              949.72548
## job              413.39190
## marital          179.07098
## education        182.70440
## default           14.65688
## balance         1112.55389
## housing          245.36679
## loan              95.48039
## duration        2356.83377
## campaign         325.41178
## pdays            581.01435
## previous         239.66907

Accuracy slightly decreased to 0.8977, suggesting additional trees did not contribute to predictive accuracy and might have introduced complexity without benefit. Tuning provided minimal improvements, indicating that the baseline settings were already near optimal for this dataset. Further exploration with other hyperparameters might be beneficial.

Adaboost

Experiment 1

Performing default performance of an AdaBoost model:

# Preparing data
set.seed(123)
train_index <- createDataPartition(bank$y, p = 0.8, list = FALSE)
train_data <- bank[train_index, ]
test_data <- bank[-train_index, ]

# Ensure the target variable 'y' is a factor
train_data$y <- factor(train_data$y, levels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c("no", "yes"))

# Training the AdaBoost model
ada_model <- ada(y ~ ., data = train_data)

# Making predictions
ada_predictions <- predict(ada_model, test_data, type = "response")

# Evaluating the model
ada_results <- confusionMatrix(ada_predictions, test_data$y)

print(ada_results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7423  682
##        yes  211  322
##                                        
##                Accuracy : 0.8966       
##                  95% CI : (0.89, 0.903)
##     No Information Rate : 0.8838       
##     P-Value [Acc > NIR] : 8.251e-05    
##                                        
##                   Kappa : 0.3681       
##                                        
##  Mcnemar's Test P-Value : < 2.2e-16    
##                                        
##             Sensitivity : 0.9724       
##             Specificity : 0.3207       
##          Pos Pred Value : 0.9159       
##          Neg Pred Value : 0.6041       
##              Prevalence : 0.8838       
##          Detection Rate : 0.8593       
##    Detection Prevalence : 0.9383       
##       Balanced Accuracy : 0.6465       
##                                        
##        'Positive' Class : no           
## 

The accuracy (0.8966) is similar to Random Forest, indicating high overall performance. The kappa value of 0.3681 and a sensitivity of 0.9724 combined with a lower specificity (0.3207) suggest that while the model is excellent at identifying the positive class, it struggles with false positives.

Experiment 2

Determining the effects of increasing the number of boosting iterations on AdaBoost’s accuracy and specificity.

# Setting hyperparameters
iter_values <- 50  # Number of iterations

# Training the AdaBoost model with tuned parameters
tuned_ada_model <- ada(y ~ ., data = train_data, iter = iter_values)

# Making predictions
tuned_ada_predictions <- predict(tuned_ada_model, test_data, type = "response")

# Evaluating the model
tuned_ada_results <- confusionMatrix(as.factor(tuned_ada_predictions), as.factor(test_data$y))

print(tuned_ada_results)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7426  686
##        yes  208  318
##                                           
##                Accuracy : 0.8965          
##                  95% CI : (0.8899, 0.9029)
##     No Information Rate : 0.8838          
##     P-Value [Acc > NIR] : 9.476e-05       
##                                           
##                   Kappa : 0.3649          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9728          
##             Specificity : 0.3167          
##          Pos Pred Value : 0.9154          
##          Neg Pred Value : 0.6046          
##              Prevalence : 0.8838          
##          Detection Rate : 0.8597          
##    Detection Prevalence : 0.9391          
##       Balanced Accuracy : 0.6447          
##                                           
##        'Positive' Class : no              
## 

The accuracy remained almost unchanged at 0.8965, indicating that additional iterations did not significantly improve performance.Kappa showed a slight decrease to 0.3649, suggesting that the model’s agreement beyond chance is stable but not improved by more iterations.

I’ll create a chart visualizing the accuracy from each experiment for better comparative analysis.

# Sample data for visualization
results <- data.frame(
  Algorithm = c("Decision Tree", "Decision Tree", "Random Forest", "Random Forest", "AdaBoost", "AdaBoost"),
  Experiment = c("Baseline", "Tuned", "Baseline", "Tuned", "Baseline", "Tuned"),
  Accuracy = c(0.8255, 0.8903, 0.8992, 0.8977, 0.8966, 0.8965)
)

# Plotting the results
library(ggplot2)
ggplot(results, aes(x = Experiment, y = Accuracy, fill = Algorithm)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  ggtitle("Accuracy Across Different Experiments") +
  xlab("Experiment Type") +
  ylab("Accuracy") +
  theme_minimal()

Decision Tree shows a significant increase in accuracy when tuned. The baseline accuracy is the lowest among the three algorithms. Random Forest exhibits high accuracy in both baseline and tuned experiments, with only a slight decrease in the tuned setup. This suggests that Random Forest is robust to overfitting, given its ensemble nature, and performs well even without extensive tuning. AdaBoost maintains the highest accuracy across both experiments, slightly dropping in the tuned scenario. This minor decrease could imply that the baseline parameters were already near optimal, and further tuning did not yield significant benefits. Random Forest and AdaBoost show high robustness, evidenced by their consistent performance across different settings. They are less sensitive to overfitting, making them reliable for various applications.