Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use.
An example that could be used for a company is using classification to predict whether someone would click a link in a marketing email. This would split the predictor variable into a “Yes” and “No” column. Some useful predictors that would work across any company would, age, sex (or gender), area code from phone number.
Using the support vector machine function ksvm contained in the R package kernlab, find a good classifier for this data. Show the equation of your classifier, and how well it classifies the data points in the full data set. (Don’t worry about test/validation data yet; we’ll cover that topic soon.)
First let’s install all the packages we’re going to use for 2.2
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(kknn)
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
cc_data_headers <- read_tsv("C:/Users/zlow3/Documents/Georgia Tech/ISYE-6501/HW 1/credit_card_data-headers.txt")
## Parsed with column specification:
## cols(
## A1 = col_double(),
## A2 = col_double(),
## A3 = col_double(),
## A8 = col_double(),
## A9 = col_double(),
## A10 = col_double(),
## A11 = col_double(),
## A12 = col_double(),
## A14 = col_double(),
## A15 = col_double(),
## R1 = col_double()
## )
cc_model <- ksvm(as.matrix(cc_data_headers[,1:10]), cc_data_headers[,11],
type = "C-svc", kernel = 'vanilladot', C = 100, scaled = TRUE)
## Setting default kernel parameters
for (i in seq(100, 1000, by = 100)) {
# calculate a1…am
a <- colSums(cc_model@xmatrix[[1]] * cc_model@coef[[1]])
a
# calculate a0
a0 <- -cc_model@b
a0
# see what the model predicts
pred <- predict(cc_model,cc_data_headers[,1:10])
pred
# see what fraction of the model’s predictions match the actual classification
accuracy = sum(pred == cc_data_headers[,11]) / nrow(cc_data_headers)
print(paste0("C = ", i, " yields a ", accuracy, " accuracy."))
}
## [1] "C = 100 yields a 0.863914373088685 accuracy."
## [1] "C = 200 yields a 0.863914373088685 accuracy."
## [1] "C = 300 yields a 0.863914373088685 accuracy."
## [1] "C = 400 yields a 0.863914373088685 accuracy."
## [1] "C = 500 yields a 0.863914373088685 accuracy."
## [1] "C = 600 yields a 0.863914373088685 accuracy."
## [1] "C = 700 yields a 0.863914373088685 accuracy."
## [1] "C = 800 yields a 0.863914373088685 accuracy."
## [1] "C = 900 yields a 0.863914373088685 accuracy."
## [1] "C = 1000 yields a 0.863914373088685 accuracy."
As we can see our parameters are:
print(a)
## A1 A2 A3 A8 A9
## -0.0010065348 -0.0011729048 -0.0016261967 0.0030064203 1.0049405641
## A10 A11 A12 A14 A15
## -0.0028259432 0.0002600295 -0.0005349551 -0.0012283758 0.1063633995
print(a0)
## [1] 0.08158492
Making our equation: -0.001A1 - 0.0012A2 - 0.00016A3 + 0.003A8 +1.005A9 - 0.002A10 + 0.0002A11 - 0.0005A12 -0.0012A14 +0.106A15 + .08. As we can see from the for loop above, changing the value of C did absolutely nothing to change the outcome of the model. I even when through and did a separate loop to make sure that the values from 0 to 100 didn’t change anything. So I would use the recommended C = 100 like the homework hints used in their formula.
Using the k-nearest-neighbors classification function kknn contained in the R kknn package, suggest a good value of k, and show how well it classifies that data points in the full data set. Don’t forget to scale the data (scale=TRUE in kknn).
set.seed(12345)
pred.kknn <- array(0, nrow(cc_data_headers))
for(j in 3:20){
for(i in 1:nrow(cc_data_headers)) {
#print(i)
model_knn = kknn(R1~., cc_data_headers[-i,],cc_data_headers[i,], k = j, scale = TRUE)
pred.kknn[i] <- as.integer(fitted(model_knn) + 0.5)
}
accuracy = sum(pred.kknn == cc_data_headers[,11]) / nrow(cc_data_headers)
print(paste0("Accuracy: ", accuracy, " for K = ",j))
}
## [1] "Accuracy: 0.814984709480122 for K = 3"
## [1] "Accuracy: 0.814984709480122 for K = 4"
## [1] "Accuracy: 0.851681957186544 for K = 5"
## [1] "Accuracy: 0.845565749235474 for K = 6"
## [1] "Accuracy: 0.847094801223242 for K = 7"
## [1] "Accuracy: 0.848623853211009 for K = 8"
## [1] "Accuracy: 0.847094801223242 for K = 9"
## [1] "Accuracy: 0.850152905198777 for K = 10"
## [1] "Accuracy: 0.851681957186544 for K = 11"
## [1] "Accuracy: 0.853211009174312 for K = 12"
## [1] "Accuracy: 0.851681957186544 for K = 13"
## [1] "Accuracy: 0.851681957186544 for K = 14"
## [1] "Accuracy: 0.853211009174312 for K = 15"
## [1] "Accuracy: 0.851681957186544 for K = 16"
## [1] "Accuracy: 0.851681957186544 for K = 17"
## [1] "Accuracy: 0.851681957186544 for K = 18"
## [1] "Accuracy: 0.850152905198777 for K = 19"
## [1] "Accuracy: 0.850152905198777 for K = 20"
As we can see from the above output, 5 appears to be a great value of K. Although some of the greater values of K have similar accuracy, it is most likely better to pick a smaller value for K for efficiency of the algorithm.