ISYE 6501 Homework 2

Question 3.1 - Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier:

(a) using cross-validation (do this for the k-nearest-neighbors model; SVM is optional); and
(b) splitting the data into training, validation, and test data sets (pick either KNN or SVM; the other is optional).

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.1

library(readr)
library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(kknn)

## Warning: package 'kknn' was built under R version 4.5.1

cc_data_headers <- read.delim("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 2/Homework2_ISYE6501-1/Homework2_ISYE6501/data 3.1/credit_card_data-headers.txt", sep = "\t", header = TRUE)

cc_data <- read.delim("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 2/Homework2_ISYE6501-1/Homework2_ISYE6501/data 3.1/credit_card_data.txt", sep = "\t", header = TRUE)



head(cc_data_headers)

head(cc_data)

#head(iris_data)

To begin our analysis, we want to split our data into three different sets: training, validation, and testings sets. To do this, we want to set up a random seed. (12345) This makes sure that we get the same values every time we run this code; otherwise, our analysis would produce different results every time we ran the code.

Additionally, we only want 60% of all the observations in our credit card data. Then, the remaining 40% of the data will be made of our validation and testing data sets. Accordingly, we then use the same formula as before to split this data up into half validation and half testing data. Thus, our final accumulation of data is 60% training, 20% validation, and 20% testing.

#set up seeds
set.seed(12345)

#The amount of rows in the data
round(nrow(cc_data_headers)*.6)

## [1] 392

#let us create 60% of our rows
cc_training_rows = sample(nrow(cc_data_headers), round(nrow(cc_data_headers)*.6))

#now create the training set of these observations
cc_training = cc_data_headers[cc_training_rows,]

#now lets us create a set of our remaining data for validation and testing
val_and_test = cc_data_headers[-cc_training_rows,]

#Now, let us split this data in half so our final numbers are 60% Training, 20% Validation and 20% Testing
set.seed(54321)

nrow(val_and_test)/2

## [1] 131

validation_rows = sample(nrow(val_and_test), round(nrow(val_and_test))/2)

cc_validation = val_and_test[validation_rows,]

cc_testing = val_and_test[-validation_rows,]

#Now we should have created all three data sets, and we can begin to use this for our model.

Now that we have split up our data sets into training, validation, and testing, we can now do some cross validation to select our best model. We are going to create 7 different training data sets with all different cost values, or c values. Then, we are going to predict our model with the validation data to determine which cost value we want to use. Once we have selected which model we are going to use, we are going to test its accuracy using our testing data.

#Let us now begin by doing our cross validation on some svm models then some knn models. Let us try 7 different values of C and 12 different values of k.

c_values = c(.001, .01, .1, 1, 10, 100, 1000)

#Also, one other note, for this exercise, we are just going to use vanilla dot.
#We are going to copy our code over from homework 1 now.

results_svm = rep(0,length(c_values))

for (i in 1:length(c_values)){

model <- ksvm(as.matrix(cc_training[,1:10]),     # predictors (first 10 columns)
              as.factor(cc_training[,11]),       # target (11th column)
              type = "C-svc",             # classification
              kernel = "vanilladot",      # linear kernel
              C = c_values[i],                    # cost parameter
              scaled = TRUE)              # scale predictors

#print(c_values[i])

pred <- predict(model, cc_validation[,1:10])

results_svm[i] = mean(pred == cc_validation[,11])


#print(results_svm[i])

}

##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters  
##  Setting default kernel parameters

results_svm

## [1] 0.7480916 0.8625954 0.8625954 0.8625954 0.8625954 0.8625954 0.8625954

#let us now determine the best c value
best_c_value = c_values[which.max(results_svm)]

best_c_value

## [1] 0.01

#Now let us use our test set on the model where we are using the best c_value

svm_mod_final <- ksvm(as.matrix(cc_training[,1:10]),     # predictors (first 10 columns)
              as.factor(cc_training[,11]),       # target (11th column)
              type = "C-svc",             # classification
              kernel = "vanilladot",      # linear kernel
              C = best_c_value,            # cost parameter
              scaled = TRUE)              # scale predictors

##  Setting default kernel parameters

svm_mod_pred <- predict(svm_mod_final, cc_testing[,1:10])

svm_result_final = mean(svm_mod_pred == cc_testing[,11])

svm_result_final

## [1] 0.8320611

We can see from our validation testing that our best cost value is .01. Now, we can use this cost value in our SVM model and predict that against our testing data. This gave us a testing accuracy of ~83.2%. Ultimately, this is pretty solid testing accuracy for using a linear kernel. This is similar to our accuracy of 86% on Homework 1. If we were to utilize a different, more complex kernel, our accuracy would likely improve.

Next, we use our cross validation method on our k-nearest neighbor algorithm. We run the same process as before where we run validation tests to determine the best value of k in our algorithm. Once we select the best value of k to use in our model, we will measure our accuracy using the testing data.

#Now let us do this same process for the knn algorithm where we use validation to find the best value of k.
k_values = c(1:12)


results_knn = rep(0,length(k_values))

for (i in 1:length(k_values)){
  
  knn_mod = kknn(
      formula = R1 ~ .,
      train = cc_training,
      test = cc_validation,
      k=i,
      scale = TRUE
    )
  
  knn_pred = (round(fitted(knn_mod)))
  
  results_knn[i] = mean(knn_pred == cc_validation[,11])
}

results_knn

##  [1] 0.7633588 0.7633588 0.7633588 0.7633588 0.8320611 0.8473282 0.8473282
##  [8] 0.8473282 0.8549618 0.8702290 0.8625954 0.8702290

# Now let us find the index of the best k value from our validation testing and use that value to test our accuracy

best_k_value = k_values[which.max(results_knn)]

best_k_value

## [1] 10

knn_mod_final = kknn(
      formula = R1 ~ .,
      train = cc_training,
      test = cc_testing,
      k=best_k_value,
      scale = TRUE
    )

knn_mod_pred = round(fitted(knn_mod_final))

knn_result_final = mean(knn_mod_pred == cc_testing[,11])

knn_result_final

## [1] 0.8167939

According to our validation testing, our best value of k is 10. Once we used k=10 in our model and measured the accuracy using our testing data, we found a correct prediction rate of ~81.7%. This is pretty similar compared to our accuracy of ~85% in Homework 1. Our results intuitively make sense because we would expect our accuracy to be slightly lower when we implement cross validation methods because we reduce the risk of fitting our model to much to specific samples of data.

Question 4.1 -Describe a situation or problem from your job, everyday life, current events, etc., for which a clustering model would be appropriate. List some (up to 5) predictors that you might use.

I believe a good situation where you might want to use a clustering model is if you wanted to categorize different ecological biomes in a given region. Biomes don’t necessarily have a given and concrete classification, so we can use certain predictors to cluster these different areas together. Here are some predictors we might want to use:

> Average Temperatures
> PH Levels in the Soil
> Annual Precipitation Levels
> Biodiversity Levels

Ultimately, we could use and factorize these different predictors in clustering algorithm, such as a K-Means Clustering Algorithm

Question 4.2 - The iris data set iris.txt contains 150 data points, each with four predictor variables and one categorical response. The predictors are the width and length of the sepal and petal of flowers and the response is the type of flower. The data is available from the R library datasets and can be accessed with iris once the library is loaded. It is also available at the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The response values are only given to see how well a specific method performed and should not be used to build the model.

Use the R function kmeans to cluster the points as well as possible. Report the best combination of predictors, your suggested value of k, and how well your best clustering predicts flower type.

To start our analysis, we want to do a min/max scaling as described in the lectures. This way our clusters aren’t skewed by the differing magnitudes of measurements. This can be represented by the formula (x-min(x))/(max(x)-min(x)).

#Let us know read in our Iris file

iris_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 2/Homework2_ISYE6501-1/Homework2_ISYE6501/data 4.2/iris.txt", header = TRUE)

iris_data

#We need to scale the data using a min/max scale = (x-min(x))/(max(x)-min(x))
iris_data_new = iris_data

for (i in 1:4){
  
  iris_data_new[,i] = (iris_data[,i]-min(iris_data[,i]))/(max(iris_data[,i])-min(iris_data[,i]))
}

iris_data = iris_data_new

iris_data

To begin the process, let us get a general idea of different algorithms with different amount of centers. In our k-mean algorithms, we are using our predictors, setting the amount of centers, and then we also have our nstart variable which just means how many time we want to rerun the process with different random starting centers. In the following models, I will be running the model 50 times each (nstart=50) and it will select the “best” model, or simply the one with the smallest cumulative distance.

# let us run some kmeans algorithms from 2 centers to 6 centers
iris_mod_1 = kmeans(iris_data[,1:4], 2, nstart = 50)

distance_1 = 0

for (i in 1:nrow(iris_data)){
  
  distance_1 = distance_1 + dist(rbind(iris_data[i,1:4], iris_mod_1$centers[iris_mod_1$cluster[i],]))
}

distance_1[1]

## [1] 37.18025

iris_mod_2 = kmeans(iris_data[,1:4], 3, nstart = 50)

distance_2 = 0

for (i in 1:nrow(iris_data)){
  
  distance_2 = distance_2 + dist(rbind(iris_data[i,1:4], iris_mod_2$centers[iris_mod_2$cluster[i],]))
}

distance_2[1]

## [1] 29.22428

iris_mod_3 = kmeans(iris_data[,1:4], 4, nstart = 50)

distance_3 = 0

for (i in 1:nrow(iris_data)){
  
  distance_3 = distance_3 + dist(rbind(iris_data[i,1:4], iris_mod_3$centers[iris_mod_3$cluster[i],]))
}

distance_3[1]

## [1] 25.97557

iris_mod_4 = kmeans(iris_data[,1:4], 5, nstart = 50)

distance_4 = 0

for (i in 1:nrow(iris_data)){
  
  distance_4 = distance_4 + dist(rbind(iris_data[i,1:4], iris_mod_4$centers[iris_mod_4$cluster[i],]))
}

distance_4[1]

## [1] 23.6132

iris_mod_5 = kmeans(iris_data[,1:4], 6, nstart = 50)

distance_5 = 0

for (i in 1:nrow(iris_data)){
  
  distance_5 = distance_5 + dist(rbind(iris_data[i,1:4], iris_mod_5$centers[iris_mod_5$cluster[i],]))
}

distance_5[1]

## [1] 22.04799

Now that we have computed the cumulative distances of models with 2,3,4,5, and 6 centers. We can now create our elbow graph to get a better representation of how many centers we should implement in our data.

#Now, let us make our "elbow graph" to see get a visualization of distances from the center

distances = c(distance_1[1],distance_2[1] ,distance_3[1],distance_4[1],distance_5[1])
centers = c(2,3,4,5,6)

table_data = data.frame(centers = centers, distances = distances)
table_data

plot(table_data$centers, table_data$distances,
     type = "b",              # both points and lines
     xlab = "Number of Centers (k)",
     ylab = "Standardized Distance",
     main = "Elbow Plot")

This graph gives us a good look at the standardized drop off in distances depending on the number of centers. Now, I want to examine models with 3, 4, and 5 centers even more.

#Let us now examine the relationshhip between between the sepals and the petals to examine our how good our data looks in each scenario.

#At this point, 2 centers and not really feasible, and there are three different species in the data, so 2 doesn't make sense. Thus, let us examine the relationship between Sepal and Petal Lengths/Widths

# 3 Centers

table(iris_mod_2$cluster, iris_data$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         47        14
##   3      0          3        36

ggplot(iris_data, aes(Petal.Length, Petal.Width, color = iris_mod_2$cluster)) + geom_point()

ggplot(iris_data, aes(Sepal.Length, Sepal.Width, color = iris_mod_2$cluster)) + geom_point()

# 4 Centers

table(iris_mod_3$cluster, iris_data$Species)

##    
##     setosa versicolor virginica
##   1      0         27         2
##   2      0          0        29
##   3      0         23        19
##   4     50          0         0

ggplot(iris_data, aes(Petal.Length, Petal.Width, color = iris_mod_3$cluster)) + geom_point()

ggplot(iris_data, aes(Sepal.Length, Sepal.Width, color = iris_mod_3$cluster)) + geom_point()

# 3 Centers

table(iris_mod_4$cluster, iris_data$Species)

##    
##     setosa versicolor virginica
##   1      0          0        29
##   2      0         27         2
##   3      0         23        19
##   4     28          0         0
##   5     22          0         0

ggplot(iris_data, aes(Petal.Length, Petal.Width, color = iris_mod_4$cluster)) + geom_point()

ggplot(iris_data, aes(Sepal.Length, Sepal.Width, color = iris_mod_4$cluster)) + geom_point()

Now after evaluating what we are working with, I think our best model was with three centers. It looks the best graphically, and it also has a table that makes the most sense, in my opinion. Intuitively, this makes sense because we are categorizing three species, as well!*

#Thus, here is our final table of classifications

table(iris_mod_2$cluster, iris_data$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         47        14
##   3      0          3        36

For this homework assignment, we used ChatGPT to help navigate the ggplot R package. We also used the R documentation for the k-mean algorithm and the rbind function

ISYE 6501 Homework 2

James Jessup