Question 2.1
A situation from everyday life a classification model could be applied to would be in differentiating if an indvidual should purchase or sell a stock based on a predicted close price. Some predictors that could be used in determining this response could be the stock’s respective high, low, close, or volume throughout a range of time. Finance data from a source like Yahoo Finance could be used to train, validate, and test our model.
Question 2.2
Libraries Needed
library(kernlab)
library(kknn)
library(rsample)
library(caret)
Importing the files “credit_card_data.txt” and renaming it to “ccdata”
ccdata <- read.table("credit_card_data.txt", header = FALSE, stringsAsFactors = FALSE)
set.seed(3)
head(ccdata)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 1 30.83 0.000 1.25 1 0 1 1 202 0 1
## 2 0 58.67 4.460 3.04 1 0 6 1 43 560 1
## 3 0 24.50 0.500 1.50 1 1 0 1 280 824 1
## 4 1 27.83 1.540 3.75 1 0 5 0 100 3 1
## 5 1 20.17 5.625 1.71 1 1 0 1 120 0 1
## 6 1 32.08 4.000 2.50 1 1 0 0 360 0 1
tail(ccdata)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 649 1 40.58 3.290 3.50 0 1 0 0 400 0 0
## 650 1 21.08 10.085 1.25 0 1 0 1 260 0 0
## 651 0 22.67 0.750 2.00 0 0 2 0 200 394 0
## 652 0 25.25 13.500 2.00 0 0 1 0 200 1 0
## 653 1 17.92 0.205 0.04 0 1 0 1 280 750 0
## 654 1 35.00 3.375 8.29 0 1 0 0 0 0 0
Iterating through different magnitudes of C putting the accuracy of the model’s prediction in a vector called accuracy_vector.
Cloop <- 10^(-5:2)
accuracy_vector <- vector("numeric")
for(lambda in Cloop) {
model <- ksvm(as.matrix(ccdata[,1:10]),
as.factor(ccdata[,11]),
type = "C-svc",
kernel = "vanilladot",
C = lambda,
scaled=TRUE)
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
a0 <- -model@b
pred <- predict(model,ccdata[,1:10])
accuracy <- sum(pred == ccdata[,11]) / nrow(ccdata)
accuracy_vector <- c(accuracy_vector,accuracy)
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
plot(accuracy_vector)
max(accuracy_vector)
## [1] 0.8639144
which.max(accuracy_vector)
## [1] 4
Calculating a1…am with C = 10^2 :
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
a
## V1 V2 V3 V4 V5
## -0.0010065348 -0.0011729048 -0.0016261967 0.0030064203 1.0049405641
## V6 V7 V8 V9 V10
## -0.0028259432 0.0002600295 -0.0005349551 -0.0012283758 0.1063633995
Calculating a0 with C = 10^2:
a0 <- -model@b
a0
## [1] 0.08158492
Model Prediction with C = 10^2:
pred <- predict(model,ccdata[,1:10])
pred
## [1] 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [297] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## [556] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
Observing the accuracy vector shows that ~86.39144% is the mode of the vector, so we will use C = 10^2. It should be noted that the accuracy is not affected until C is significantly small.
accuracy_vector
## [1] 0.5474006 0.5474006 0.8379205 0.8639144 0.8639144 0.8639144 0.8639144
## [8] 0.8639144
With a and a0, the equation for our classifier can be expressed as follows:
equation <- paste("0 =", a[1], "* V1 +", a[2], "* V2 +", a[3], "* V3 +", a[4], "* V4 +", a[5], "* V5 +", a[6], "* V6 +", a[7], "* V7 +", a[8], "* V8 +", a[9], "* V9 +", a[10], "* V10 +", a0)
equation
## [1] "0 = -0.00100653481057611 * V1 + -0.00117290480611665 * V2 + -0.00162619672236963 * V3 + 0.0030064202649194 * V4 + 1.00494056410556 * V5 + -0.00282594323043472 * V6 + 0.000260029507016313 * V7 + -0.000534955143494997 * V8 + -0.00122837582291523 * V9 + 0.106363399527188 * V10 + 0.081584921659538"
#KKNN function that iterates through different values of K and stores the accuracy of that respective model into a vector so that we can determine which K value gives use the highest accuracy on the dataset.
knn_pred <- rep(0, nrow(ccdata)) #vector of all 0's the size of our dataset that will be filled with 1's & 0's our prediction based from our model
knn_acc_vector <- vector("numeric") #empty vector to store the accurracy of our model for each iteration of K
for (K in 1:50) { #number of K values to iterate through
for (i in 1:nrow(ccdata)) { #for each data point where i is the data point
knn_model <- kknn(ccdata[-i,11]~., #Can also be "V11 ~.,"
ccdata[-i,1:10], #Train on all the predictors for all but the ith data point
ccdata[i,1:10], #Test on all the predictors including i
k = K,
kernel = "optimal",
scale = TRUE)
knn_pred[i] <- round(fitted(knn_model)) #"fitted will return the predicted respones from our model. Since kknn will read responses as continous, we can use the round function to make make all predictions either 1 or 0 it will then be stored into our previously vector of all 0's"
knn_acc <- sum(knn_pred == ccdata[,11]) / nrow(ccdata) #sums all the data points where our prediction matches our data set and then divides it over the number of datapoints we have to determine accuracy
}
knn_acc_vector <- c(knn_acc_vector,knn_acc) #for each K, store the accuracy in a vector
}
plot(knn_acc_vector)
max(knn_acc_vector) # Accurate 85.32% of the time!
## [1] 0.853211
which.max(knn_acc_vector) #Max accuracy @ K = 12
## [1] 12
Question 3.1a
Using the full ccdata set, we can train our model using the k-fold crossvalidation by function cv.kknn from library(kknn). Keeping the number of folds constant, we can iterate through which K we want for the nearest neighbor model. We could do the opposite and keep the K nearest neighbor constant, and iterate to determine the best number of folds to use too!
k_acc_vec = vector("numeric")
for (K in 1:50) {
kmodel3 <- cv.kknn(V11 ~ .,
ccdata,
kcv = 10, # # of folds
k = K,
kernel = "optimal",
scale = TRUE)
kmodel3 <- data.frame(kmodel3) #cv.kknn function outputs our prediction in a weird way, so we can use the data.frame function to put into a normal matrix
kmodelpred2 <- kmodel3[,2] #the 2nd column has our model predictions
rpred2 <- round(kmodelpred2) #round them so that they are 1 or 0
k_accuracy3 <- sum(rpred2 == ccdata[,11]) / nrow(ccdata)
k_acc_vec <- c(k_acc_vec, k_accuracy3)
}
plot(k_acc_vec)
max(k_acc_vec) # 85.53% accurate
## [1] 0.8577982
which.max(k_acc_vec) # Most accurate with a K value of 20
## [1] 5
Training of kknn via leave-one-out cross validation method
set.seed(3)
kmodel <- train.kknn(V11 ~.,
ccdata,
kmax = 100,
kernel = "optimal",
scale = TRUE)
kpred <- predict(kmodel, ccdata)
roundedpred <- round(kpred)
k_accuracy <- sum(roundedpred == ccdata[,11])/ nrow(ccdata)
k_accuracy
## [1] 0.8776758
kmodel
##
## Call:
## train.kknn(formula = V11 ~ ., data = ccdata, kmax = 100, kernel = "optimal", scale = TRUE)
##
## Type of response variable: continuous
## minimal mean absolute error: 0.1850153
## Minimal mean squared error: 0.1073792
## Best kernel: optimal
## Best k: 58
Question 3.1b
Splitting the data into training, validation, and test data, we can compare between the KNN and SVM.
set.seed(3)
#Splitting data into 70% training, 15% validation, and 15% testin
ccdatasplit <- sample(1:3, nrow(ccdata), prob = c(.7,.15,.15), replace = TRUE)
cctrain <- ccdata[ccdatasplit == 1,]
ccvalid <- ccdata[ccdatasplit == 2,]
cctest <- ccdata[ccdatasplit == 3,]
#Training KSVM Model using our previous code to find the C value that has the lowest training error on our training set.
Cloop <- 10^(-3:3)
ksvm_acc_vec <- vector("numeric")
for(lambda in Cloop) {
ksvm_model <- ksvm(as.matrix(cctrain[,1:10]),
as.factor(cctrain[,11]),
type = "C-svc",
kernel = "vanilladot",
C = lambda,
scaled=TRUE)
a <- colSums(ksvm_model@xmatrix[[1]] * ksvm_model@coef[[1]])
a0 <- -ksvm_model@b
prediction <- predict(ksvm_model,cctrain[,1:10])
ksvm_acc <- sum(prediction == cctrain[,11]) / nrow(cctrain)
ksvm_acc_vec <- c(ksvm_acc_vec,ksvm_acc)
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
ksvm_acc_vec
## [1] 0.8200000 0.8711111 0.8711111 0.8711111 0.8711111 0.8711111 0.8711111
Our ksvm_model appears to have a pretty consistent accuracy at 87.11% as long as our C value is not significantly small, so we will use C=100 for our KSVM model for validation.
set.seed(3)
ksvm_model2 <- ksvm(as.matrix(cctrain[,1:10]), #train on training set
as.factor(cctrain[,11]),
type = "C-svc",
kernel = "vanilladot",
C = 100,
scaled=TRUE)
## Setting default kernel parameters
ksvm_prediction_valid <- predict(ksvm_model2, ccvalid[,1:10]) #predicting how the model will do on our validation set's predictors.
ksvm.acc <- sum(ksvm_prediction_valid == ccvalid[,11]) / nrow(ccvalid)
ksvm.acc # 87.75% accurate!
## [1] 0.877551
I was skeptical about the predict function so i wanted to see if i could reproduce this accuracy by training the model on the validation set itself
#Validating KSVM Model
set.seed(3)
ksvm_model_valid <- ksvm(as.matrix(ccvalid[,1:10]),
as.factor(ccvalid[,11]),
type = "C-svc",
kernel = "vanilladot",
C = 100,
scaled=TRUE)
## Setting default kernel parameters
ksvm_model_valid
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 100
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 42
##
## Objective Function Value : -2400.483
## Training error : 0.122449
1-.122449
## [1] 0.877551
Got the same answer! From the validation data, our ksvm_model has a training error of about 0.122449. So, its accurate about 87.75% of the time. Slightly higher than our training set. Lets see how our KNN model performs against the training set.
set.seed(3)
#Finding best K on the training set by iterating through different values of K. The K with the highest accuracy will be used for our kknn model that uses validation dataset.
knn_pred2 <- rep(0, nrow(cctrain))
knn_acc_vector2 <- vector("numeric")
for (K in 1:50) {
for (i in 1:nrow(cctrain)) {
knn_model2 <- kknn(cctrain[-i,11]~.,
cctrain[-i,1:10],
cctrain[i,1:10],
k = K,
kernel = "optimal",
scale = TRUE)
knn_pred2[i] <- round(fitted(knn_model2))
knn_acc2 <- sum(knn_pred2 == cctrain[,11]) / nrow(cctrain)
}
knn_acc_vector2 <- c(knn_acc_vector2,knn_acc2)
}
plot(knn_acc_vector2)
max(knn_acc_vector2) # 84.667% Accurate!
## [1] 0.8466667
which.max(knn_acc_vector2) #K value of 10 is the most accurate
## [1] 10
K = 10 had the highest accuracy on the training set with an accuracy of 84.667%. Lets see how this performs on the validation set.
set.seed(3)
knn_pred3 <- rep(0, nrow(ccvalid))
knn_acc_vector3 <- vector("numeric")
for (i in 1:nrow(ccvalid)) {
knn_model3 <- kknn(ccvalid[-i,11]~.,
ccvalid[-i,1:10],
ccvalid[i,1:10],
k = 10,
kernel = "optimal",
scale = TRUE)
knn_pred3[i] <- round(predict(knn_model3))
knn.acc = sum(knn_pred3 == ccvalid[,11])/ nrow(ccvalid)
}
knn.acc # 85.7% accurate witht he validation set
## [1] 0.8571429
KKNN model performed slightly better on the validation set. Since the SVM model performed best on the validation set, we will use our SVM model on the test data set to see how well our model can predict.
set.seed(3)
ksvm_prediction_test <- predict(ksvm_model2, cctest[,1:10])
ksvm.acc2 <- sum(ksvm_prediction_test == cctest[,11]) / nrow(cctest)
ksvm.acc2 # 82.07% accurate on the test set!
## [1] 0.8207547
##Conclusion
The KSVM model is a better model for our data due to its higher performance on the validation set compared to KNN. Using our KSVM model on the test set, our model is accurate about 82.07% of the time, down from the 87.11% accuracy we observed on the training set.