ISYE 6501

Question 2.2

Using the support vector machine function ksvm contained in the R package kernlab, find a good classifier for this data. Show the equation of your classifier, and how well it classifies the data points in the full data set. (Don’t worry about test/validation data yet; we’ll cover that topic soon.)

First let’s install all the packages we’re going to use for 2.2

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ---------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(kknn)
library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:purrr':
## 
##     cross

## The following object is masked from 'package:ggplot2':
## 
##     alpha

cc_data_headers <- read_tsv("C:/Users/zlow3/Documents/Georgia Tech/ISYE-6501/HW 1/credit_card_data-headers.txt")

## Parsed with column specification:
## cols(
##   A1 = col_double(),
##   A2 = col_double(),
##   A3 = col_double(),
##   A8 = col_double(),
##   A9 = col_double(),
##   A10 = col_double(),
##   A11 = col_double(),
##   A12 = col_double(),
##   A14 = col_double(),
##   A15 = col_double(),
##   R1 = col_double()
## )

cc_model <- ksvm(as.matrix(cc_data_headers[,1:10]), cc_data_headers[,11], 
                 type = "C-svc", kernel = 'vanilladot', C = 100, scaled = TRUE)

##  Setting default kernel parameters

for (i in seq(100, 1000, by = 100)) {
  # calculate a1…am
  a <- colSums(cc_model@xmatrix[[1]] * cc_model@coef[[1]])
  a

  # calculate a0
  a0 <- -cc_model@b
  a0

  # see what the model predicts
  pred <- predict(cc_model,cc_data_headers[,1:10])
  pred
  # see what fraction of the model’s predictions match the actual classification
  accuracy = sum(pred == cc_data_headers[,11]) / nrow(cc_data_headers)
  print(paste0("C = ", i, " yields a ", accuracy, " accuracy."))
  
  
}

## [1] "C = 100 yields a 0.863914373088685 accuracy."
## [1] "C = 200 yields a 0.863914373088685 accuracy."
## [1] "C = 300 yields a 0.863914373088685 accuracy."
## [1] "C = 400 yields a 0.863914373088685 accuracy."
## [1] "C = 500 yields a 0.863914373088685 accuracy."
## [1] "C = 600 yields a 0.863914373088685 accuracy."
## [1] "C = 700 yields a 0.863914373088685 accuracy."
## [1] "C = 800 yields a 0.863914373088685 accuracy."
## [1] "C = 900 yields a 0.863914373088685 accuracy."
## [1] "C = 1000 yields a 0.863914373088685 accuracy."

As we can see our parameters are:

print(a)

##            A1            A2            A3            A8            A9 
## -0.0010065348 -0.0011729048 -0.0016261967  0.0030064203  1.0049405641 
##           A10           A11           A12           A14           A15 
## -0.0028259432  0.0002600295 -0.0005349551 -0.0012283758  0.1063633995

print(a0)

## [1] 0.08158492

Making our equation: -0.001A1 - 0.0012A2 - 0.00016A3 + 0.003A8 +1.005A9 - 0.002A10 + 0.0002A11 - 0.0005A12 -0.0012A14 +0.106A15 + .08. As we can see from the for loop above, changing the value of C did absolutely nothing to change the outcome of the model. I even when through and did a separate loop to make sure that the values from 0 to 100 didn’t change anything. So I would use the recommended C = 100 like the homework hints used in their formula.

Using the k-nearest-neighbors classification function kknn contained in the R kknn package, suggest a good value of k, and show how well it classifies that data points in the full data set. Don’t forget to scale the data (scale=TRUE in kknn).

set.seed(12345)


pred.kknn <- array(0, nrow(cc_data_headers))
for(j in 3:20){
  for(i in 1:nrow(cc_data_headers)) {
  #print(i)
    
  model_knn = kknn(R1~., cc_data_headers[-i,],cc_data_headers[i,], k = j, scale = TRUE)
  pred.kknn[i] <- as.integer(fitted(model_knn) + 0.5)
  }
  
  
  accuracy = sum(pred.kknn == cc_data_headers[,11]) / nrow(cc_data_headers)
  print(paste0("Accuracy: ", accuracy, " for K = ",j))
  
}

## [1] "Accuracy: 0.814984709480122 for K = 3"
## [1] "Accuracy: 0.814984709480122 for K = 4"
## [1] "Accuracy: 0.851681957186544 for K = 5"
## [1] "Accuracy: 0.845565749235474 for K = 6"
## [1] "Accuracy: 0.847094801223242 for K = 7"
## [1] "Accuracy: 0.848623853211009 for K = 8"
## [1] "Accuracy: 0.847094801223242 for K = 9"
## [1] "Accuracy: 0.850152905198777 for K = 10"
## [1] "Accuracy: 0.851681957186544 for K = 11"
## [1] "Accuracy: 0.853211009174312 for K = 12"
## [1] "Accuracy: 0.851681957186544 for K = 13"
## [1] "Accuracy: 0.851681957186544 for K = 14"
## [1] "Accuracy: 0.853211009174312 for K = 15"
## [1] "Accuracy: 0.851681957186544 for K = 16"
## [1] "Accuracy: 0.851681957186544 for K = 17"
## [1] "Accuracy: 0.851681957186544 for K = 18"
## [1] "Accuracy: 0.850152905198777 for K = 19"
## [1] "Accuracy: 0.850152905198777 for K = 20"

As we can see from the above output, 5 appears to be a great value of K. Although some of the greater values of K have similar accuracy, it is most likely better to pick a smaller value for K for efficiency of the algorithm.

ISYE 6501 - HW1

Zach Lowenberger

1/16/2021

Question 2.1

Question 2.2