Question 3.1 Using the same data set (credit_card_data.txt) as in Question 2.2, use the ksvm of kknn function to find a good classifier:

a. using cross-validation (do this for the k-nearest-neighbors model; SVM is optional); and

First, I graphed the dependent variable to see the distribution between 0-1. From the graph we can see that most responses were 0 (358).

## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::alpha() masks kernlab::alpha()
## ✖ purrr::cross()   masks kernlab::cross()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

To perform the cross-validation I followed the instructions on the K-fold cross-validation of the following webpages: 1 and 2 to conduct the cross-validation for the k-nearest neighbor algorithm. These websites suggested using the caret package instead of kvsm.

I split data randomly into training+validation and test. As suggested by Professor Sokol, I used 80% for training+validation and 20% for testing.

train_index <- createDataPartition(data[, ncol(data)], p = 0.8, list = FALSE) #split data for training and test

train_data <- data[train_index, ]
test_data  <- data[-train_index, ]

train_control <- trainControl(method = "cv", 
                              number = 10) #number of subsets for cross-validaton

model_knn <- train(
  x = train_data[, -ncol(train_data)], # independent variables without dependent variable
  y=train_data$R1, # dependent variable
  method = "kknn", # model type
  trControl = train_control, # control parameters
  preProcess = c("center", "scale"), # standardize the data
  tuneGrid = expand.grid(
    kmax = 3:15, # number of neighbors to test
    distance = 2, # using (2) Euclidean distance
    kernel = "rectangular" # all neighbors weight the same
  ) 
)

plot(model_knn)

model_knn$finalModel # print best model in validation
## 
## Call:
## kknn::train.kknn(formula = .outcome ~ ., data = dat, kmax = param$kmax,     distance = param$distance, kernel = as.character(param$kernel))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.1526718
## Best kernel: rectangular
## Best k: 9

The graph shows the k (number of neighbors to test) versus the accuracy of the model. As can be seen, the best model is the one under k=5, which we will use to test the model. We can also print the final model which corresponds to the one with the highest accuracy.

Finally, we test the best model using the test data we have. The caret package selects the best model and uses it in the prediction function. As can be seen, the accuracy of the best model was 78%.

It is interesting to mention that, as the training subsets change each time we run the function, the results for the best model also change. This called my attention as I wonder how can we be sure about the model that we are choosing if randomness also influences the results in each run we perform.

pred <- predict(model_knn, test_data[, -ncol(test_data)])
score <- (sum(pred == test_data$R1) / nrow(test_data))
score
## [1] 0.8538462

After running the knn model, I ran the SVM model using the caret package following the example in this webpage.

model_svm <- train(
  x = train_data[, -ncol(train_data)], # independent variables without dependent variable
  y=train_data$R1, # dependent variable
  method = "svmRadial", # common kernel 
  trControl = train_control,  
  preProcess = c("center","scale"),
  tuneLength = 10) # 10 instances generated for C and sigma

plot(model_svm)

model_svm$bestTune
##       sigma    C
## 1 0.1076356 0.25

The previous graph shows the results of the cross-validation for the SVM model. As can be seen, the highest accuracy is achieved at C=0.25. Finally, we test the model using the test data and find the accuracy of the model as 86.1%.

When doing cross-validation, both knn and svm performed similarly good.

pred <- predict(model_svm, test_data[, -ncol(test_data)])
score <- (sum(pred == test_data$R1) / nrow(test_data))
score
## [1] 0.8461538

b. splitting the data into training, validation and test data sets (pick either KNN or SVM; the other is optional)

I decided to test the classification model for both kknn and svm. First, I split the data into training, validation and testing. For this case, I chose to split the data into 50% training, and the remaining 50% evenly split between validation and test.

train_index <- createDataPartition(data$R1, p = 0.5, list = FALSE)
train_data <- data[train_index, ]  # Training set (50%)
temp_data <- data[-train_index, ]  # Remaining 50%

val_index <- createDataPartition(temp_data$R1, p = 0.5, list = FALSE) # Partition of remaining data in half
validation_data <- temp_data[val_index, ]  # Validation set (25%)
test_data <- temp_data[-val_index, ] # Test set (25%)

Second, I trained the knn and svm models using the training data. Similar to when performing cross-validation, I tested different values of k (3 to 15) and C and sigma (10 generated instances) for both models. Later, I tested the models using the validation data set.

model_knn <- train(
  x = train_data[, -ncol(train_data)],
  y = train_data$R1,
  method = "kknn",
  preProcess = c("center","scale"),
  tuneGrid = expand.grid(
    kmax = 3:15, # number of neighbors to test
    distance = 2, # using (2) Euclidean distance
    kernel = "rectangular" # all neighbors weight the same
  ) 
)

model_svm <- train(
  x = train_data[, -ncol(train_data)],
  y = train_data$R1,
  method = "svmRadial",
  preProcess = c("center","scale"),
  tunelenght=10
)

pred_val_knn <- predict(model_knn, validation_data[, -ncol(validation_data)])
score_val_knn <- (sum(pred == validation_data$R1) / nrow(validation_data))
## Warning in `==.default`(pred, validation_data$R1): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
pred_val_svm <- predict(model_svm, validation_data[, -ncol(validation_data)])
score_val_svm <- (sum(pred == validation_data$R1) / nrow(validation_data))
## Warning in `==.default`(pred, validation_data$R1): longer object length is not
## a multiple of shorter object length
## Warning in `==.default`(pred, validation_data$R1): longer object length is not
## a multiple of shorter object length
cat("Accuracy of knn model:", score_val_knn,"\n")
## Accuracy of knn model: 0.5060976
cat("Accuracy of svm model:", score_val_svm)
## Accuracy of svm model: 0.5060976

As both models performed similarly with a 50% accuracy, I tested both models with the test data.

pred_test_knn <- predict(model_knn, test_data[, -ncol(test_data)])
score_test_knn <- (sum(pred == test_data$R1) / nrow(test_data))
## Warning in `==.default`(pred, test_data$R1): longer object length is not a
## multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
pred_test_svm <- predict(model_svm, test_data[, -ncol(test_data)])
score_test_svm <- (sum(pred == test_data$R1) / nrow(test_data))
## Warning in `==.default`(pred, test_data$R1): longer object length is not a
## multiple of shorter object length
## Warning in `==.default`(pred, test_data$R1): longer object length is not a
## multiple of shorter object length
cat("Accuracy of knn model:", score_test_knn,"\n")
## Accuracy of knn model: 0.5092025
cat("Accuracy of svm model:", score_test_svm)
## Accuracy of svm model: 0.5092025

As can be seen, both models performed the same with the test data. They performed better (57%) than the validation data, but not as good as the model found using the cross-validation approach (80%).

Question 4.1 Describe a situation from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you may use.

In my home university, my department sometimes offers special discounts to students to join grad school. They could use a clustering model to analyze multiple attributes on alumni to assign the discount level for students based on the cluster they belong to. My department is interested in offering discounts to students to join grad school as they have graduation quotas to keep and are also interested in having alumni who improve the reputation of the institution.

For my home university, they are interested in offering higher discounts to students who tend to pay on time, or have good grades, or have a successful career, among others. In this case, some attributes include (1) GPA at undergrad, (2) income, (3) graduation year (as people who graduated a long time ago could drop grad school easier), and (4) undergrad degree (to see how related it is to the grad degree they want to pursue).

Question 4.2 The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical response. The predictors are the width and length of the sepal and petal of flowers and the response is the type of flower. Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k and how well your best clustering predicts flower type.

For this question, I followed the instructions on this website to learn about the usage of the iris dataset. Later, I used this website to understand how to use kmeans.

From the next summary we can see that there are three species of flower, which could suggest that should be the ideal k value to use. However, I will test k values ranging from 1 to 10.

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Warning: package 'gridExtra' was built under R version 4.4.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

The following graph shows the number of clusters (k) versus the within group sum of square. As can be seen, the elbow forms around k=2, which suggests this is the ideal k value. This result warrants deeper analysis as we know there are in fact 3 flower species. I decided to compare the results for k=2 and k=3.

model_kmeans_2 <- kmeans(iris_num,centers=2,nstart=25)
model_kmeans_3 <- kmeans(iris_num,centers=3,nstart=25)

model_kmeans_2
## K-means clustering with 2 clusters of sizes 53, 97
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.005660    3.369811     1.560377    0.290566
## 2     6.301031    2.886598     4.958763    1.695876
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  28.55208 123.79588
##  (between_SS / total_SS =  77.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
model_kmeans_3
## K-means clustering with 3 clusters of sizes 62, 50, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.901613    2.748387     4.393548    1.433871
## 2     5.006000    3.428000     1.462000    0.246000
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 3 3 3
## [112] 3 3 1 1 3 3 3 3 1 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1 3
## [149] 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 39.82097 15.15100 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

When analyzing the results for k=2 and k=3, we see that the total variance explained by the model, represented by the ratio between within cluster sum of squares and total sum of squared is higher for k=3 (88.4%) than for k=2 (77.6%). This result contradicts my finding on the optimal number of clusters. To continue analyzing the results, I graphed the clusters created, as well as the confusion matrices for both models.

As can be seen in the graph, when k=2, there are two clear groups, but when k=3, we have two clusters that overlap. In practice, choosing k comes from a conceptual analysis. We should have an idea prior hand of a realistic number of clusters for your data set. In this case, we know there are 3 clusters.

Finally, the following table shows the overlapping issue mentioned, as there are some data points miss classified as versicolor when they are not. From these results, the best model corresponds to k=3.

##    
##     setosa versicolor virginica
##   1      0         48        14
##   2     50          0         0
##   3      0          2        36