Question 3.1 - Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the ksvm or kknn function to find a good classifier:
(a) using cross-validation (do this for the
k-nearest-neighbors model; SVM is optional); and
(b) splitting the data into training, validation, and test data
sets (pick either KNN or SVM; the other is optional).
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.1
library(readr)
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(kknn)
## Warning: package 'kknn' was built under R version 4.5.1
cc_data_headers <- read.delim("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 2/Homework2_ISYE6501-1/Homework2_ISYE6501/data 3.1/credit_card_data-headers.txt", sep = "\t", header = TRUE)
cc_data <- read.delim("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 2/Homework2_ISYE6501-1/Homework2_ISYE6501/data 3.1/credit_card_data.txt", sep = "\t", header = TRUE)
head(cc_data_headers)
head(cc_data)
#head(iris_data)
To begin our analysis, we want to split our data into three different sets: training, validation, and testings sets. To do this, we want to set up a random seed. (12345) This makes sure that we get the same values every time we run this code; otherwise, our analysis would produce different results every time we ran the code.
Additionally, we only want 60% of all the observations in our credit card data. Then, the remaining 40% of the data will be made of our validation and testing data sets. Accordingly, we then use the same formula as before to split this data up into half validation and half testing data. Thus, our final accumulation of data is 60% training, 20% validation, and 20% testing.
#set up seeds
set.seed(12345)
#The amount of rows in the data
round(nrow(cc_data_headers)*.6)
## [1] 392
#let us create 60% of our rows
cc_training_rows = sample(nrow(cc_data_headers), round(nrow(cc_data_headers)*.6))
#now create the training set of these observations
cc_training = cc_data_headers[cc_training_rows,]
#now lets us create a set of our remaining data for validation and testing
val_and_test = cc_data_headers[-cc_training_rows,]
#Now, let us split this data in half so our final numbers are 60% Training, 20% Validation and 20% Testing
set.seed(54321)
nrow(val_and_test)/2
## [1] 131
validation_rows = sample(nrow(val_and_test), round(nrow(val_and_test))/2)
cc_validation = val_and_test[validation_rows,]
cc_testing = val_and_test[-validation_rows,]
#Now we should have created all three data sets, and we can begin to use this for our model.
Now that we have split up our data sets into training, validation, and testing, we can now do some cross validation to select our best model. We are going to create 7 different training data sets with all different cost values, or c values. Then, we are going to predict our model with the validation data to determine which cost value we want to use. Once we have selected which model we are going to use, we are going to test its accuracy using our testing data.
#Let us now begin by doing our cross validation on some svm models then some knn models. Let us try 7 different values of C and 12 different values of k.
c_values = c(.001, .01, .1, 1, 10, 100, 1000)
#Also, one other note, for this exercise, we are just going to use vanilla dot.
#We are going to copy our code over from homework 1 now.
results_svm = rep(0,length(c_values))
for (i in 1:length(c_values)){
model <- ksvm(as.matrix(cc_training[,1:10]), # predictors (first 10 columns)
as.factor(cc_training[,11]), # target (11th column)
type = "C-svc", # classification
kernel = "vanilladot", # linear kernel
C = c_values[i], # cost parameter
scaled = TRUE) # scale predictors
#print(c_values[i])
pred <- predict(model, cc_validation[,1:10])
results_svm[i] = mean(pred == cc_validation[,11])
#print(results_svm[i])
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
results_svm
## [1] 0.7480916 0.8625954 0.8625954 0.8625954 0.8625954 0.8625954 0.8625954
#let us now determine the best c value
best_c_value = c_values[which.max(results_svm)]
best_c_value
## [1] 0.01
#Now let us use our test set on the model where we are using the best c_value
svm_mod_final <- ksvm(as.matrix(cc_training[,1:10]), # predictors (first 10 columns)
as.factor(cc_training[,11]), # target (11th column)
type = "C-svc", # classification
kernel = "vanilladot", # linear kernel
C = best_c_value, # cost parameter
scaled = TRUE) # scale predictors
## Setting default kernel parameters
svm_mod_pred <- predict(svm_mod_final, cc_testing[,1:10])
svm_result_final = mean(svm_mod_pred == cc_testing[,11])
svm_result_final
## [1] 0.8320611
We can see from our validation testing that our best cost value is .01. Now, we can use this cost value in our SVM model and predict that against our testing data. This gave us a testing accuracy of ~83.2%. Ultimately, this is pretty solid testing accuracy for using a linear kernel. This is similar to our accuracy of 86% on Homework 1. If we were to utilize a different, more complex kernel, our accuracy would likely improve.
Next, we use our cross validation method on our k-nearest neighbor algorithm. We run the same process as before where we run validation tests to determine the best value of k in our algorithm. Once we select the best value of k to use in our model, we will measure our accuracy using the testing data.
#Now let us do this same process for the knn algorithm where we use validation to find the best value of k.
k_values = c(1:12)
results_knn = rep(0,length(k_values))
for (i in 1:length(k_values)){
knn_mod = kknn(
formula = R1 ~ .,
train = cc_training,
test = cc_validation,
k=i,
scale = TRUE
)
knn_pred = (round(fitted(knn_mod)))
results_knn[i] = mean(knn_pred == cc_validation[,11])
}
results_knn
## [1] 0.7633588 0.7633588 0.7633588 0.7633588 0.8320611 0.8473282 0.8473282
## [8] 0.8473282 0.8549618 0.8702290 0.8625954 0.8702290
# Now let us find the index of the best k value from our validation testing and use that value to test our accuracy
best_k_value = k_values[which.max(results_knn)]
best_k_value
## [1] 10
knn_mod_final = kknn(
formula = R1 ~ .,
train = cc_training,
test = cc_testing,
k=best_k_value,
scale = TRUE
)
knn_mod_pred = round(fitted(knn_mod_final))
knn_result_final = mean(knn_mod_pred == cc_testing[,11])
knn_result_final
## [1] 0.8167939
According to our validation testing, our best value of k is 10. Once
we used k=10 in our model and measured the accuracy using our testing
data, we found a correct prediction rate of ~81.7%. This is pretty
similar compared to our accuracy of ~85% in Homework 1. Our results
intuitively make sense because we would expect our accuracy to be
slightly lower when we implement cross validation methods because we
reduce the risk of fitting our model to much to specific samples of
data.
Question 4.1 -Describe a situation or problem from your
job, everyday life, current events, etc., for which a clustering model
would be appropriate. List some (up to 5) predictors that you might
use.
I believe a good situation where you might want to use a clustering
model is if you wanted to categorize different ecological biomes in a
given region. Biomes don’t necessarily have a given and concrete
classification, so we can use certain predictors to cluster these
different areas together. Here are some predictors we might want to
use:
> Average Temperatures
> PH Levels
in the Soil
> Annual Precipitation
Levels
> Biodiversity Levels
Ultimately, we could use and factorize these different predictors in
clustering algorithm, such as a K-Means Clustering Algorithm
Question 4.2 - The iris data set iris.txt contains 150
data points, each with four predictor variables and one categorical
response. The predictors are the width and length of the sepal and petal
of flowers and the response is the type of flower. The data is available
from the R library datasets and can be accessed with iris once the
library is loaded. It is also available at the UCI Machine Learning
Repository (https://archive.ics.uci.edu/ml/datasets/Iris ). The
response values are only given to see how well a specific method
performed and should not be used to build the model.
Use the R function kmeans to cluster the points as well
as possible. Report the best combination of predictors, your suggested
value of k, and how well your best clustering predicts flower
type.
To start our analysis, we want to do a min/max scaling as described in the lectures. This way our clusters aren’t skewed by the differing magnitudes of measurements. This can be represented by the formula (x-min(x))/(max(x)-min(x)).
#Let us know read in our Iris file
iris_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 2/Homework2_ISYE6501-1/Homework2_ISYE6501/data 4.2/iris.txt", header = TRUE)
iris_data
#We need to scale the data using a min/max scale = (x-min(x))/(max(x)-min(x))
iris_data_new = iris_data
for (i in 1:4){
iris_data_new[,i] = (iris_data[,i]-min(iris_data[,i]))/(max(iris_data[,i])-min(iris_data[,i]))
}
iris_data = iris_data_new
iris_data
To begin the process, let us get a general idea of different algorithms with different amount of centers. In our k-mean algorithms, we are using our predictors, setting the amount of centers, and then we also have our nstart variable which just means how many time we want to rerun the process with different random starting centers. In the following models, I will be running the model 50 times each (nstart=50) and it will select the “best” model, or simply the one with the smallest cumulative distance.
# let us run some kmeans algorithms from 2 centers to 6 centers
iris_mod_1 = kmeans(iris_data[,1:4], 2, nstart = 50)
distance_1 = 0
for (i in 1:nrow(iris_data)){
distance_1 = distance_1 + dist(rbind(iris_data[i,1:4], iris_mod_1$centers[iris_mod_1$cluster[i],]))
}
distance_1[1]
## [1] 37.18025
iris_mod_2 = kmeans(iris_data[,1:4], 3, nstart = 50)
distance_2 = 0
for (i in 1:nrow(iris_data)){
distance_2 = distance_2 + dist(rbind(iris_data[i,1:4], iris_mod_2$centers[iris_mod_2$cluster[i],]))
}
distance_2[1]
## [1] 29.22428
iris_mod_3 = kmeans(iris_data[,1:4], 4, nstart = 50)
distance_3 = 0
for (i in 1:nrow(iris_data)){
distance_3 = distance_3 + dist(rbind(iris_data[i,1:4], iris_mod_3$centers[iris_mod_3$cluster[i],]))
}
distance_3[1]
## [1] 25.97557
iris_mod_4 = kmeans(iris_data[,1:4], 5, nstart = 50)
distance_4 = 0
for (i in 1:nrow(iris_data)){
distance_4 = distance_4 + dist(rbind(iris_data[i,1:4], iris_mod_4$centers[iris_mod_4$cluster[i],]))
}
distance_4[1]
## [1] 23.6132
iris_mod_5 = kmeans(iris_data[,1:4], 6, nstart = 50)
distance_5 = 0
for (i in 1:nrow(iris_data)){
distance_5 = distance_5 + dist(rbind(iris_data[i,1:4], iris_mod_5$centers[iris_mod_5$cluster[i],]))
}
distance_5[1]
## [1] 22.04799
Now that we have computed the cumulative distances of models with 2,3,4,5, and 6 centers. We can now create our elbow graph to get a better representation of how many centers we should implement in our data.
#Now, let us make our "elbow graph" to see get a visualization of distances from the center
distances = c(distance_1[1],distance_2[1] ,distance_3[1],distance_4[1],distance_5[1])
centers = c(2,3,4,5,6)
table_data = data.frame(centers = centers, distances = distances)
table_data
plot(table_data$centers, table_data$distances,
type = "b", # both points and lines
xlab = "Number of Centers (k)",
ylab = "Standardized Distance",
main = "Elbow Plot")
This graph gives us a good look at the standardized drop off in distances depending on the number of centers. Now, I want to examine models with 3, 4, and 5 centers even more.
#Let us now examine the relationshhip between between the sepals and the petals to examine our how good our data looks in each scenario.
#At this point, 2 centers and not really feasible, and there are three different species in the data, so 2 doesn't make sense. Thus, let us examine the relationship between Sepal and Petal Lengths/Widths
# 3 Centers
table(iris_mod_2$cluster, iris_data$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 47 14
## 3 0 3 36
ggplot(iris_data, aes(Petal.Length, Petal.Width, color = iris_mod_2$cluster)) + geom_point()
ggplot(iris_data, aes(Sepal.Length, Sepal.Width, color = iris_mod_2$cluster)) + geom_point()
# 4 Centers
table(iris_mod_3$cluster, iris_data$Species)
##
## setosa versicolor virginica
## 1 0 27 2
## 2 0 0 29
## 3 0 23 19
## 4 50 0 0
ggplot(iris_data, aes(Petal.Length, Petal.Width, color = iris_mod_3$cluster)) + geom_point()
ggplot(iris_data, aes(Sepal.Length, Sepal.Width, color = iris_mod_3$cluster)) + geom_point()
# 3 Centers
table(iris_mod_4$cluster, iris_data$Species)
##
## setosa versicolor virginica
## 1 0 0 29
## 2 0 27 2
## 3 0 23 19
## 4 28 0 0
## 5 22 0 0
ggplot(iris_data, aes(Petal.Length, Petal.Width, color = iris_mod_4$cluster)) + geom_point()
ggplot(iris_data, aes(Sepal.Length, Sepal.Width, color = iris_mod_4$cluster)) + geom_point()
Now after evaluating what we are working with, I think our best model was with three centers. It looks the best graphically, and it also has a table that makes the most sense, in my opinion. Intuitively, this makes sense because we are categorizing three species, as well!*
#Thus, here is our final table of classifications
table(iris_mod_2$cluster, iris_data$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 47 14
## 3 0 3 36
For this homework assignment, we used ChatGPT to help navigate the ggplot R package. We also used the R documentation for the k-mean algorithm and the rbind function