knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Question 3.1

To find the best classifier model among KNN and SVM models, I am going to approach them by using (a) cross-validation and (b) splitting, for all KNN and linear/non-linear SVMs. The accuracy results will be reported in a summary table in part (c) Result Discussion at the end of this section (Question 3.1).

The following part is my step-by-step codelines for the mentioned models.

a. Cross-validation (CV)

a.1 K-Nearest-Neighbors Model**

Step 1. Install and load kknn packages. Then import the dataset, setting seed for reproducibility and assigning the 11th attribute as the response. I would also randomly assign 80% of the data points to cross-validation and leave the rest 20% for testing.

library(kknn)
data <- read.table("~/Dropbox/Nga_GA/ISYE6501/Homework2_ISYE6501/data3.1/credit_card_data.txt", header = FALSE)
compare_result <- data.frame(Model = character(), 
                            Validation = numeric(), 
                            Test = numeric())
                        

set.seed(123)
data$V11 <- as.factor(data$V11)
n <- nrow(data)
cv_idx <- sample(1:n, 0.8 * n)
cv_data <- data[cv_idx, ]
test_data  <- data[-cv_idx, ]

Step 2. I created number of folds for cross-validation and a list of k values for iteration. Detailed steps are put down in between-code comments

k_folds <- 10 #Divide dataset into 10 folds
folds <- sample(rep(1:k_folds, length.out = nrow(cv_data))) #Randomly assign each row of the cross-validation data to 1 of the 20 folds
k_vals <- 1:30 #Create a list of k values from 1 to 30
acc_cv <- numeric(length(k_vals)) #the average accuracy for each k value

Step 3. Create the nested looping for cross-validation. Then for each k value, I would iterate through each fold, use that fold for validation and train the model on the rest 9 folds. This process will repeat 9 times for each k value, resulting an average accuracy acc_cv for each k, which then would be plotted in Figure 1 for visualization.

for (k in k_vals) {
  acc_per_fold <- numeric(k_folds)
  
  for (fold in 1:k_folds) {
    validate_idx <- which(folds == fold)
    train_data <- cv_data[-validate_idx, ]
    validate_data <- cv_data[validate_idx, ]
    
    knn_model <- kknn(V11 ~ ., train = train_data, test = validate_data, 
                      k = k, kernel = 'rectangular', scale=TRUE)
    pred <- predict(knn_model, type= "raw")
    acc_per_fold[fold] <- mean(pred == validate_data$V11) #Getting the accuracy for each fold
  }
  
  acc_cv[k] <- mean(acc_per_fold) #Taking accuracy for each k value by averaging out the accuracy of all folds
}

plot(k_vals, acc_cv, type = "b", col = "blue", pch = 16,
     xlab = "k", ylab = "Average CV Accuracy",
     main = "Figure 1. Cross-Validated Accuracy vs k")

Step 4. Find the best k value based on the above accuracy results across k values.

best_k <- which.max(acc_cv)
cat("Best k from CV:", best_k, "with accuracy:", round(acc_cv[best_k], 4), "\n")
## Best k from CV: 7 with accuracy: 0.8548

Step 5. Use the best k value as the parameter for the final model. Then making prediction and test on the untouched 20% of the data.

final_knn_model <- kknn(V11 ~ ., train = cv_data, test = test_data, 
                        k = best_k, kernel = 'rectangular', scale=TRUE) #Use best k for final KNN model
test_pred_knn <- predict(final_knn_model, type= "raw") #Predict on test set
test_accuracy_knn <- mean(test_pred_knn == test_data$V11) #Compute test accuracy
compare_result <- rbind(compare_result, 
                        data.frame(Model = "CV KNN", 
                                   Validation = max(acc_cv), 
                                   Test = test_accuracy_knn))
cat("Test set accuracy using k =", best_k, "is", round(test_accuracy_knn, 4), "\n")
## Test set accuracy using k = 7 is 0.8015

####a.2 Linear and Non-Linear SVM models** I also built SVM models using cross-validation, out of curiosity whether there is any difference or improvement compared to the above result from cross-validated KNN.

Step 1. Import kernlab the SVM package. Import the dataset again in case reviewer only run this code chunk without running the above.

#install.packages("kernlab")
library(kernlab)
 
#data <- read.table("~/Dropbox/Nga_GA/ISYE6501/Homework2_ISYE6501/data3.1/credit_card_data.txt", header = FALSE)

Step 2. Setting the random seed for reproducibility, I then divided the dataset into 2 parts: 80% for cross-validation and the rest 20% for testing. I also list 1 linear kernel vanilladot and 1 non-linear kernel rbfdot to compare between different types of SVM models. I also tried different C values to finetune the parameter combination by comparing the accuracy results, which would be reported in the summary dataframe.

set.seed(234)
data$V11 <- as.factor(data$V11)
n <- nrow(data)
cv_idx <- sample(1:n, 0.8 * n)
cv_data <- data[cv_idx, ]
test_data  <- data[-cv_idx, ]

kernels <- c("vanilladot", "rbfdot") # Build SVM with different kernels
C_values <- c(0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000) #Examples of C values to test 

results <- data.frame(Kernel = character(), C = numeric(), Accuracy = numeric())

Step 3. I divided the cross-validation set into 5 folds then looping through each value of kernels, C and each fold to obtain the average accuracy. The model is trained on the train dataset train_data, then validated on validate_data. The accuracy across different kernels and different C values is sorted and reported in table results.

k_folds <- 10 #Divide dataset into 10 folds
folds <- sample(rep(1:k_folds, length.out = nrow(cv_data))) #Randomly assign each row of the cross-validation data to 1 of the 20 folds

for (j in seq_along(kernels)) { 
  for (i in seq_along(C_values)) {
    acc_per_fold <- numeric(k_folds)
    for (fold in 1:k_folds) {
    validate_idx <- which(folds == fold)
    train_data <- cv_data[-validate_idx, ]
    validate_data <- cv_data[validate_idx, ]
    model_ksvm <- ksvm(as.matrix(train_data[,1:10]), as.factor(train_data[,11]),
                       type = "C-svc", kernel = kernels[j], C = C_values[i], scaled = TRUE)

    pred <- predict(model_ksvm, validate_data[,1:10])

    acc_per_fold[fold] <- mean(pred == validate_data$V11) #Getting the accuracy for each fold
    }
  
  acc_cv <- mean(acc_per_fold) #Taking accuracy for each C value by averaging out the accuracy of all folds

# Add to results
results <- rbind(results,
                 data.frame(Kernel = kernels[j],
                            C = C_values[i],
                            Accuracy = acc_cv))

}
}
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters
results_sorted <- results[order(-results$Accuracy), ] 
results_sorted
##        Kernel     C  Accuracy
## 3  vanilladot 1e-02 0.8604862
## 4  vanilladot 1e-01 0.8604862
## 5  vanilladot 1e+00 0.8604862
## 6  vanilladot 1e+01 0.8604862
## 7  vanilladot 1e+02 0.8604862
## 8  vanilladot 1e+03 0.8604862
## 9  vanilladot 1e+04 0.8604862
## 15     rbfdot 1e+00 0.8585994
## 14     rbfdot 1e-01 0.8547896
## 10 vanilladot 1e+05 0.8529753
## 16     rbfdot 1e+01 0.8299710
## 2  vanilladot 1e-03 0.8070029
## 17     rbfdot 1e+02 0.7956459
## 18     rbfdot 1e+03 0.7804427
## 19     rbfdot 1e+04 0.7670900
## 20     rbfdot 1e+05 0.7554790
## 1  vanilladot 1e-04 0.5428157
## 11     rbfdot 1e-04 0.5428157
## 12     rbfdot 1e-03 0.5428157
## 13     rbfdot 1e-02 0.5428157
best_result <- results[which.max(results$Accuracy), ]

Step 4. Based on the best result (highest accuracy) from step 3, now using the parameter combination of vanilladot and C = 0.01, I retrained the model on all cv_data and then tested it on test_data. The accuracy on test set is checked and reported below:

final_model <- ksvm(as.matrix(cv_data[,1:10]), as.factor(cv_data[,11]),
                    type = "C-svc", kernel = best_result$Kernel, 
                    C = best_result$C, scaled = TRUE)
##  Setting default kernel parameters
test_pred_svm <- predict(final_model, test_data[,1:10])
test_accuracy_svm <- mean(test_pred_svm == test_data$V11)

compare_result <- rbind(compare_result, 
                        data.frame(Model = "CV SVM", 
                                   Validation = best_result$Accuracy, 
                                   Test = test_accuracy_svm
                                   ))

cat("Test accuracy using kernel = '",best_result$Kernel,"' and C = ",best_result$C, "is: ", round(test_accuracy_svm, 4), "\n")
## Test accuracy using kernel = ' vanilladot ' and C =  0.01 is:  0.8702

b. Splitting the data into 60% for training, 25% for validation and 15% for testing

For both KNN and SVM models, I used the same data set splitting in Step 1.

set.seed(456)  # for reproducibility
n <- nrow(data)
data$V11 <- as.factor(data$V11)

splits <- sample(c("train", "validate", "test"), size = n, replace = TRUE, 
                 prob = c(0.6, 0.25, 0.15)) #Split the data points randomly to 3 parts 

train_split <- data[splits == "train", ]
validate_split <- data[splits == "validate", ]
test_split <- data[splits == "test", ]

####b.1. K-Nearest_Neighbor Model Step 2.** I training my KNN model with only training set, and tuning k value on validation set.

k_vals <- 1:30
acc_val <- numeric(length(k_vals))

for (k in k_vals) {
  split_knn_model <- kknn(V11 ~ ., train = train_split, test = validate_split, 
                k = k, kernel = "rectangular", scale = TRUE)

  pred <- predict(split_knn_model, type = "raw")
  acc_val[k] <- mean(pred == validate_split$V11)
}
# Choose best k
best_k_split <- which.max(acc_val)
cat("Best k based on validation:", best_k_split, "with accuracy:", 
    round(acc_val[best_k_split], 4), "\n")
## Best k based on validation: 7 with accuracy: 0.8497

I also plotted the validation accuracy across different k values for better visualization:

plot(k_vals, acc_val, type = "b", pch = 16, col = "blue",
     xlab = "k", ylab = "Validation Accuracy",
     main = "Figure 2. KNN Accuracy on Validation Set")

Step 3. Now using the best k value, I input this parameter to retrain my model on train + validation data subset, then use this retrained model to predict on test set.

# Combine training and validation sets
trainval_split <- rbind(train_split, validate_split)

# Train final model with best k
final_split_knn_model <- kknn(V11 ~ ., 
                    train = trainval_split, 
                    test = test_split, 
                    k = best_k_split, 
                    kernel = "rectangular", 
                    scale = TRUE)

test_pred <- predict(final_split_knn_model, type = "raw")
test_acc_split_knn  <- mean(test_pred == test_split$V11)

compare_result <- rbind(compare_result, 
                        data.frame(Model = "Split KNN", 
                                   Validation = acc_val[best_k_split], 
                                   Test = test_acc_split_knn
                                   ))

cat("Final split KNN model accuracy on test set with k =", best_k_split, "is", round(test_acc_split_knn, 4), "\n")
## Final split KNN model accuracy on test set with k = 7 is 0.9091

####b.2. SVM Model Step 2. I trained both linear vanilladot and non-linear rbfdot SVM models on training set, then tuning C value (from 1 to ) and choosing best kernerl (either vanilladot or rbfdot)

results_split <- data.frame(Kernel = character(), C = numeric(), Accuracy = numeric())
kernels <- c("vanilladot", "rbfdot") # Build SVM with different kernels
C_values <- c(0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000) #Examples of C values to test 

for (j in seq_along(kernels)) { 
  for (i in seq_along(C_values)) {
    model_split_ksvm <- ksvm(as.matrix(train_split[,1:10]), as.factor(train_split[,11]),
                       type = "C-svc", kernel = kernels[j], C = C_values[i], scaled = TRUE)

    pred <- predict(model_split_ksvm, validate_split[,1:10])

    acc_split_val <- mean(pred == validate_split$V11) #Getting the accuracy for each fold


# Add to results
results_split <- rbind(results_split,
                 data.frame(Kernel = kernels[j],
                            C = C_values[i],
                            Accuracy = acc_split_val))

}
}
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters
results_split_sorted <- results_split[order(-results_split$Accuracy), ] 
results_split_sorted
##        Kernel     C  Accuracy
## 3  vanilladot 1e-02 0.8670520
## 4  vanilladot 1e-01 0.8670520
## 5  vanilladot 1e+00 0.8670520
## 6  vanilladot 1e+01 0.8670520
## 7  vanilladot 1e+02 0.8670520
## 8  vanilladot 1e+03 0.8670520
## 9  vanilladot 1e+04 0.8670520
## 10 vanilladot 1e+05 0.8670520
## 14     rbfdot 1e-01 0.8612717
## 15     rbfdot 1e+00 0.8612717
## 16     rbfdot 1e+01 0.8323699
## 18     rbfdot 1e+03 0.7687861
## 19     rbfdot 1e+04 0.7630058
## 20     rbfdot 1e+05 0.7630058
## 17     rbfdot 1e+02 0.7398844
## 2  vanilladot 1e-03 0.6936416
## 1  vanilladot 1e-04 0.5491329
## 11     rbfdot 1e-04 0.5491329
## 12     rbfdot 1e-03 0.5491329
## 13     rbfdot 1e-02 0.5491329
best_split_svm <- results_split[which.max(results_split$Accuracy), ]
print(best_split_svm)
##       Kernel    C Accuracy
## 3 vanilladot 0.01 0.867052

Step 3. Now I choose the linear SVM model kernel = "vanilladot" with C = 0.01 to retrain my model before testing on the test set.

final_split_ksvm_model <- ksvm(as.matrix(trainval_split[,1:10]),
                               as.factor(trainval_split[,11]),
                               type = "C-svc", kernel = best_split_svm$Kernel,
                               C = best_split_svm$C, scaled = TRUE)
##  Setting default kernel parameters
test_pred_split_svm <- predict(final_split_ksvm_model, test_split[,1:10])
test_accuracy_split_svm <- mean(test_pred_split_svm == test_split$V11)

compare_result <- rbind(compare_result, 
                        data.frame(Model = "Split SVM", 
                                   Validation = best_split_svm$Accuracy, 
                                   Test = test_accuracy_split_svm
                                   ))

cat("Test accuracy using kernel = '",best_split_svm$Kernel,"' and C = ", 
    best_split_svm$C, "is: ", round(test_accuracy_split_svm, 4), "\n")
## Test accuracy using kernel = ' vanilladot ' and C =  0.01 is:  0.8909

c. Result Discussion

The accuracy results of 4 models above is shown below. The best performing classifier is the KNN model which is built on splitting the data set into 3 separated parts: 60% for training, 25% for validation and 15% for testing. The accuracy on test set is 90.91% with k = 7.

When we compare the validation accuracy, the cross-validated linear SVM seemed to perform the best, but its accuracy on test set turned out to be at the third place. This implied the problem of over-optimism due to exposure to validation data during cross-validation. By splitting the data set and only expose the model to validation subset for parameter tuning but not for training, we could limit the over-optimistic problem to some extend, and achieve a better test accuracy with Split models.

compare_result
##       Model Validation      Test
## 1    CV KNN  0.8547896 0.8015267
## 2    CV SVM  0.8604862 0.8702290
## 3 Split KNN  0.8497110 0.9090909
## 4 Split SVM  0.8670520 0.8909091

Question 4.1

When I worked at a cafe, we would like to categorize customers’ profiles into groups so that our marketing strategies could be more specifically-targeting hence more likely to succeed. However, we did not pre-define those groups, so a clustering model would be suitable in this case. The predictors we used were: - age - occupation - sales channels (in-person, via app, or on web) - mode of consumption (in-store, pickup, or delivery) - timestamp of the transaction/sales

Question 4.2 K-Means Clustering of Iris data set

To find the best K-means clustering of our data points, I would divide my modelling process into 3 steps: (1) find the optimal k, (2) find the best combination of predictors, and (3) cross-check between confusion matrix, result from (2) and clustering visualization.

Using the elbow method to find the optimal clustering number k

Before any modelling, I import the data set and scale the data on the first 4 variables (except the last attribute, which is the label). I then load relevant packages for clustering, setting seed for reproducibility. To find the best number of clustering k (or number of center in kmeans() function), I iterated through a range of k from 1 to 5, run the kmeans model and compare the total within-cluster sum of squares ttwss = tot.withinss. After that, I plotted the ttwss corresponding to number of k, so show that when k>3, the ttwss did not decrease as significantly as when k=<3. Result: So based on my observation, the optimal k should be 3.

rm(list = ls())
dt <- read.table("/Users/nganguyen/Dropbox/Nga_GA/ISYE6501/Homework2_ISYE6501/data4.2/iris.txt", header = TRUE)
dt_scaled <- scale(dt[1:4])

#install.packages("tidyverse")
#install.packages("cluster")
#install.packages("factoextra")
library(tidyverse)
library(cluster)
library(factoextra)


n_clusters <- 5
ttwss <- numeric(n_clusters)

# Elbow method to choose the optimal number of clusters
set.seed(42)
for (i in 1:n_clusters) {
  km_model <- kmeans(dt_scaled, centers = i, nstart = 20)
  ttwss[i] = km_model$tot.withinss
}

ttwss_df <- tibble(clusters = 1:n_clusters, ttwss = ttwss)

elbow_plot = ggplot(ttwss_df, aes(x = clusters, y = ttwss, group = 1)) + 
  geom_point(size = 4)+
  geom_line()+
  scale_x_continuous(breaks = c(1, 2, 3, 4, 5)) +
  xlab("Number of clusters") +
  ylab("Total Within-Cluster Sum of Squares")
  
elbow_plot

Finding the best combination of predictors

Because there are 4 variables to use for kmeans clustering, it is possible to loop through all combinations of them, including combination of 1, 2, 3 and 4 variables. I would run my model for each loop, getting the accuracy for comparison, and store all results in a list for visualization later. The accuracy was calculated by using the Hugarian algorithm, called by solve_LSAP() function in package clue.

Result: the best predictors are Petal.Width (accuracy 96%), Petal.Length (accuracy 94.67%) or the combination of those 2 (accuracy 96%).

#install.packages("combinat")
#install.packages("clue")
library(combinat) 
library(clue)   

labels <- dt[, 5]  # true class labels

# Encode labels as numeric
true_labels <- as.numeric(as.factor(labels))

combi_results <- list()
all_cols <- colnames(dt_scaled)

for (k in 1:4) {
  combi <- combn(all_cols, k, simplify = FALSE)
  for (vars in combi) {
    subset <- dt_scaled[, vars, drop = FALSE]
    set.seed(421)
    km <- kmeans(subset, centers = 3, nstart = 20)
    
    tab <- table(km$cluster, true_labels)
    mapping <- solve_LSAP(tab, maximum = TRUE)
    mapped <- as.numeric(mapping[km$cluster])
    acc <- mean(mapped == true_labels)
    
    key <- paste(vars, collapse = ", ")
    combi_results[[key]] <- list(
      variables = vars,
      accuracy = round(acc, 4),
      km_model = km,
      mapped_clusters = mapped
    )
  }
}

# Create summary data frame
combi_df <- data.frame(
  Variables = names(combi_results),
  Accuracy = sapply(combi_results, function(x) x$accuracy),
  row.names = NULL
)

combi_df_sorted <- combi_df[order(-combi_df$Accuracy), ]
print(combi_df_sorted)
##                                               Variables Accuracy
## 4                                           Petal.Width   0.9600
## 10                            Petal.Length, Petal.Width   0.9600
## 3                                          Petal.Length   0.9467
## 13              Sepal.Length, Petal.Length, Petal.Width   0.8667
## 14               Sepal.Width, Petal.Length, Petal.Width   0.8600
## 7                             Sepal.Length, Petal.Width   0.8333
## 15 Sepal.Length, Sepal.Width, Petal.Length, Petal.Width   0.8333
## 9                              Sepal.Width, Petal.Width   0.8200
## 6                            Sepal.Length, Petal.Length   0.8067
## 11              Sepal.Length, Sepal.Width, Petal.Length   0.8067
## 12               Sepal.Length, Sepal.Width, Petal.Width   0.8067
## 5                             Sepal.Length, Sepal.Width   0.7733
## 8                             Sepal.Width, Petal.Length   0.7667
## 1                                          Sepal.Length   0.7067
## 2                                           Sepal.Width   0.5600

Cross-checking with confusion matrix and visualizing the clustering

This last step is just to have a look at the confusion matrix of the top 3 combinations of predictors. So based on Petal alone, width or length, our kmeans model can clustered well the iris class setosa. Looking at the plots, we can easily observe that this class setosa positioning quite far and separated from the rest of data points. For other classes versicolor and virginica, our kmeans model mis-label a few data points locating at the boundary of the 2 clusters 2 and 3.

library(ggplot2)
library(gridExtra)

top_combos <- head(combi_df_sorted$Variables, 3)
x_axis <- "Petal.Length"
y_axis <- "Petal.Width"
plot_list <- list()

for (name in top_combos) {
  res <- combi_results[[name]]  # already stored result
  # Print confusion matrix
  cat("\nConfusion matrix for:", name, "\n")
  print(table(Cluster = res$mapped_clusters, TrueLabel = labels))
  
  # Use Petal.Length and Petal.Width from dt_scaled for plotting
  plot_df <- as.data.frame(dt_scaled[, c(x_axis, y_axis)])
  plot_df$Cluster <- factor(res$mapped_clusters)
  plot_df$TrueLabel <- factor(true_labels)
  
  p <- ggplot(plot_df, aes_string(x = x_axis, y = y_axis)) +
    geom_point(aes(color = Cluster, shape = TrueLabel), size = 3, alpha = 0.7) +
    labs(
      title = paste0("Clustered using:", name, "\nAccuracy: ", sprintf("%.2f%%", 100 * res$accuracy))
    ) +
    coord_fixed() +
    theme_minimal()
  
  plot_list[[length(plot_list) + 1]] <- p
}
## 
## Confusion matrix for: Petal.Width 
##        TrueLabel
## Cluster setosa versicolor virginica
##       1     50          0         0
##       2      0         48         4
##       3      0          2        46
## 
## Confusion matrix for: Petal.Length, Petal.Width 
##        TrueLabel
## Cluster setosa versicolor virginica
##       1     50          0         0
##       2      0         48         4
##       3      0          2        46
## 
## Confusion matrix for: Petal.Length 
##        TrueLabel
## Cluster setosa versicolor virginica
##       1     50          0         0
##       2      0         48         6
##       3      0          2        44
for (p in plot_list) {
  print(p)
}