Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use.
In my current job, the Information Technologies department maintains a list of requested Projects in wait state. Currently it is a table in an Excel spreadsheet and it has columns for Project Name, Description, Status, Start and Due Date, Sponsor, etc. In addition, we have several columns that could be used as predictors if we give these a numeric weight, as Category (Survival, Growth, Maintenance), Timing (Near term, Short term, Long term), Risk, Potential ROI, Cost and others. I think it will be a neat exercise to use a classification method to decide which projects should be started and which ones to keep in waiting state and giving us a starting order. We may factor in their precedence or dependency too. If we do so, we will transform the spreadsheet from a Project List into a Project Portfolio.
The files credit_card_data.txt (without headers) and credit_card_data-headers.txt (with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It has anonymized credit card applications with a binary response variable (last column) indicating if the application was positive or negative. The dataset is the âCredit Approval Data Setâ from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Credit+Approval) without the categorical variables and without data points that have missing values.
- Using the support vector machine function ksvm contained in the R package kernlab, find a good classifier for this data. Show the equation of your classifier, and how well it classifies the data points in the full data set. (Donât worry about test/validation data yet; weâll cover that topic soon.) Notes on ksvm:
- You can use scaled=TRUE to get ksvm to scale the data as part of calculating a classifier.
- The term λ we used in the SVM lesson to trade off the two components of correctness and margin is called C in ksvm. One of the challenges of this homework is to find a value of C that works well; for many values of C, almost all predictions will be âyesâ or almost all predictions will be ânoâ.
- ksvm does not directly return the coefficients a0 and a1â¦am. Instead, you need to do the last step of the calculation yourself.
filename = "E:/mzambrano/Documents/OneDrive/Personaldocs/Backtoschool/GeorgiaTech/MicroMasters/GTx_ISYE6501x_IntroductionToAnalyticsModeling/02 - HW1/Data2.2/credit_card_data-headers.txt"
# Get the data from file
ccdata <- as.matrix(read.delim(filename, header=TRUE))
# load library kernlab
library(kernlab)
## Warning: package 'kernlab' was built under R version 3.4.4
skernel = "vanilladot"
cvalue <- 100
# this oly works if ccdata is a matrix
# we use the first 10 columns as data and the 11th column as response vector
# as per the question we don't worry about splitting into test/validation, use all the data
model <- ksvm(ccdata[,1:10],ccdata[,11],type="C-svc",kernel=skernel,C=cvalue,scaled=TRUE)
## Setting default kernel parameters
# Calculate a1...am
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
# Calculate a0
a0 <- -model@b
# model predictions (model vs data)
pred <- predict(model,ccdata[,1:10])
# see what fraction of the model's predictions match the actual classification
matching <- sum(pred == ccdata[,11]) / nrow(ccdata)
a
## A1 A2 A3 A8 A9
## -0.0010065348 -0.0011729048 -0.0016261967 0.0030064203 1.0049405641
## A10 A11 A12 A14 A15
## -0.0028259432 0.0002600295 -0.0005349551 -0.0012283758 0.1063633995
a0
## [1] 0.08158492
matching
## [1] 0.8639144
The equation is the equation of a soft classifier
What is left is evaluating with values of lambda (C) in various degrees of magnitude
# Get the data from file
filename = "E:/mzambrano/Documents/OneDrive/Personaldocs/Backtoschool/GeorgiaTech/MicroMasters/GTx_ISYE6501x_IntroductionToAnalyticsModeling/02 - HW1/Data2.2/credit_card_data-headers.txt"
ccdata <- as.matrix(read.delim(filename, header=TRUE))
# load library kernlab
library(kernlab)
skernel = "vanilladot"
# vector containing c values (lambdas)
cvalues <- c(.00001,.0001,.001,.01,.1, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000,100000000)
# vector containing labels for showin in plot
cxlabels <- c("1e-05","1e-04","1e-03","1e-02","1e-01","1e+00","1e+01","1e+02","1e+03","1e+04","1e+05","1e+06","1e+07","1e+08")
# Data frame to store results
results <- data.frame("kernelname" = character(), "value of C" = integer(), matching = numeric(), stringsAsFactors = FALSE)
for (cvalue in cvalues)
{
# Execute ksvm function with 14 values of lambda, from 10e-05 to 10e+08
model <- ksvm(ccdata[,1:10],ccdata[,11],type="C-svc",kernel=skernel,C=cvalue,scaled=TRUE)
# model predictions (model vs data)
pred <- predict(model,ccdata[,1:10])
# see what fraction of the model's predictions match the actual classification
matching <- sum(pred == ccdata[,11]) / nrow(ccdata)
# add to results dataframe
results[nrow(results) + 1,] <- c(skernel,cvalue,matching)
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
results
## kernelname value.of.C matching
## 1 vanilladot 1e-05 0.547400611620795
## 2 vanilladot 1e-04 0.547400611620795
## 3 vanilladot 0.001 0.837920489296636
## 4 vanilladot 0.01 0.863914373088685
## 5 vanilladot 0.1 0.863914373088685
## 6 vanilladot 1 0.863914373088685
## 7 vanilladot 10 0.863914373088685
## 8 vanilladot 100 0.863914373088685
## 9 vanilladot 1000 0.862385321100917
## 10 vanilladot 10000 0.862385321100917
## 11 vanilladot 1e+05 0.863914373088685
## 12 vanilladot 1e+06 0.625382262996942
## 13 vanilladot 1e+07 0.545871559633027
## 14 vanilladot 1e+08 0.663608562691132
# let's visualize the results
plot(results[results[,1]==skernel,3], type = "o", col = "red", xlab = "Values of C (lambda)", ylab = "match %",
main = "Kernels: matching % vs c values", axes = FALSE)
text(1:14, .57, cxlabels,cex=0.6)
text(1:14, .99, cxlabels,cex=0.6)
axis(side=1, at=c(1:14), cex.axis = 0.6, labels=cxlabels)
axis(side=2, at=seq(0, 1.1, by=0.025), cex.axis=0.6)
legend("center",legend=skernel, col="red", lty=1:2, cex=0.8)
grid()
We can see that the vanilladot kernel achieves a maximum matching vs the actual classsification of 86.39% and that those values are maintened when lambda is 10e-02 through 10e+05, results out of that range diminish precision rapidly.
- You are welcome, but not required, to try other (nonlinear) kernels as well; weâre not covering them in this course, but they can sometimes be useful and might provide better predictions than vanilladot.
This is the most generalized example, we do try four kernels and 14 different values of lambda. The results are then plotted for easy indentification.
# Get the data from file
filename = "E:/mzambrano/Documents/OneDrive/Personaldocs/Backtoschool/GeorgiaTech/MicroMasters/GTx_ISYE6501x_IntroductionToAnalyticsModeling/02 - HW1/Data2.2/credit_card_data-headers.txt"
ccdata <- as.matrix(read.delim(filename, header=TRUE))
# load library kernlab
library(kernlab)
# Auxiliary lists and variables
#---------------------------------------------------------------------------
# vector containing kernel name
kernels <- c("rbfdot","splinedot","vanilladot","polydot")
# vector with colors to apply in plot
colors <- c("red","blue","green","black")
# vector containing c values (lambdas)
cvalues <- c(.00001,.0001,.001,.01,.1, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000,100000000)
# vector containing labels for showin in plot
cxlabels <- c("1e-05","1e-04","1e-03","1e-02","1e-01","1e+00","1e+01","1e+02","1e+03","1e+04","1e+05","1e+06","1e+07","1e+08")
# Data frame to store results
results <- data.frame("kernelname" = character(), "value of C" = integer(), matching = numeric(), stringsAsFactors = FALSE)
n <- 0
#---------------------------------------------------------------------------
# main loop, get each kernel in vector
for (skernel in kernels)
{
# counter
n <- n + 1
# loop lambda values
for (cvalue in cvalues)
{
# this oly works if ccdata is a matrix
model <- ksvm(ccdata[,1:10],ccdata[,11],type="C-svc",kernel=skernel,C=cvalue,scaled=TRUE)
# if ccdata is a dataframe, use as.matrix and as.factor
# model <- ksvm(as.matrix(ccdata[,1:10]),as.factor(ccdata[,11]),type="C-svc",kernel="vanilladot",C=1000,scaled=TRUE)
# Calculate a1...am
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
a
# Calculate a0
a0 <- -model@b
a0
# model predictions (model vs data)
pred <- predict(model,ccdata[,1:10])
pred
# see what fraction of the model's predictions match the actual classification
matching <- sum(pred == ccdata[,11]) / nrow(ccdata)
matching
# add to results
results[nrow(results) + 1,] <- c(skernel,cvalue,matching)
#print(skernel)
}
if (n == 1)
{
plot(results[results[,1]==skernel,3], type = "o", col = colors[n], xlab = "Values of C", ylab = "match %",
main = "Kernels: matching % vs c values", axes = FALSE)
# text(1:14, .57, cxlabels,cex=0.6)
text(1:14, .99, cxlabels,cex=0.6)
axis(side=1, at=c(1:14), cex.axis = 0.6, labels=cxlabels)
axis(side=2, at=seq(0, 1.1, by=0.025), cex.axis=0.6)
}
else
{
lines(results[results[,1]==skernel,3], type = "o", col = colors[n])
}
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
legend("center",legend=kernels, col=colors, lty=1:2, cex=0.8)
grid()
results
## kernelname value.of.C matching
## 1 rbfdot 1e-05 0.547400611620795
## 2 rbfdot 1e-04 0.547400611620795
## 3 rbfdot 0.001 0.547400611620795
## 4 rbfdot 0.01 0.567278287461774
## 5 rbfdot 0.1 0.859327217125382
## 6 rbfdot 1 0.871559633027523
## 7 rbfdot 10 0.90519877675841
## 8 rbfdot 100 0.954128440366973
## 9 rbfdot 1000 0.984709480122324
## 10 rbfdot 10000 0.995412844036697
## 11 rbfdot 1e+05 0.996941896024465
## 12 rbfdot 1e+06 0.998470948012232
## 13 rbfdot 1e+07 1
## 14 rbfdot 1e+08 1
## 15 splinedot 1e-05 0.577981651376147
## 16 splinedot 1e-04 0.623853211009174
## 17 splinedot 0.001 0.782874617737003
## 18 splinedot 0.01 0.81039755351682
## 19 splinedot 0.1 0.944954128440367
## 20 splinedot 1 0.966360856269113
## 21 splinedot 10 0.978593272171254
## 22 splinedot 100 0.978593272171254
## 23 splinedot 1000 0.978593272171254
## 24 splinedot 10000 0.978593272171254
## 25 splinedot 1e+05 0.978593272171254
## 26 splinedot 1e+06 0.943425076452599
## 27 splinedot 1e+07 0.877675840978593
## 28 splinedot 1e+08 0.865443425076453
## 29 vanilladot 1e-05 0.547400611620795
## 30 vanilladot 1e-04 0.547400611620795
## 31 vanilladot 0.001 0.837920489296636
## 32 vanilladot 0.01 0.863914373088685
## 33 vanilladot 0.1 0.863914373088685
## 34 vanilladot 1 0.863914373088685
## 35 vanilladot 10 0.863914373088685
## 36 vanilladot 100 0.863914373088685
## 37 vanilladot 1000 0.862385321100917
## 38 vanilladot 10000 0.862385321100917
## 39 vanilladot 1e+05 0.863914373088685
## 40 vanilladot 1e+06 0.625382262996942
## 41 vanilladot 1e+07 0.545871559633027
## 42 vanilladot 1e+08 0.663608562691132
## 43 polydot 1e-05 0.547400611620795
## 44 polydot 1e-04 0.547400611620795
## 45 polydot 0.001 0.837920489296636
## 46 polydot 0.01 0.863914373088685
## 47 polydot 0.1 0.863914373088685
## 48 polydot 1 0.863914373088685
## 49 polydot 10 0.863914373088685
## 50 polydot 100 0.863914373088685
## 51 polydot 1000 0.862385321100917
## 52 polydot 10000 0.862385321100917
## 53 polydot 1e+05 0.862385321100917
## 54 polydot 1e+06 0.331804281345566
## 55 polydot 1e+07 0.767584097859327
## 56 polydot 1e+08 0.67737003058104
From these results we can see that rbfdot and splinedot achieve high levels of matching with the actual classification (close to 100%). These also have better performance that vanilladot. rbfdot maintains 99%+ from 10e+4 to 10e+6 splinedot maintains 97%+ for a bigger interval, 10e01 to 10e10
- Using the k-nearest-neighbors classification function kknn contained in the R kknn package, suggest a good value of k, and show how well it classifies that data points in the full data set. Don’t forget to scale the data (scale=TRUE in kknn).
First try:
# Read the data from file
filename = "E:/mzambrano/Documents/OneDrive/Personaldocs/Backtoschool/GeorgiaTech/MicroMasters/GTx_ISYE6501x_IntroductionToAnalyticsModeling/02 - HW1/Data2.2/credit_card_data-headers.txt"
ccdata <- read.delim(filename, header=TRUE)
# Load library
library(kknn)
## Warning: package 'kknn' was built under R version 3.4.4
# we take a sample of about a 1/3 for learning and 2/3 for testing
numSample <- sample(1:654, 210)
learningData <- ccdata[numSample, ]
testingData <- ccdata[-numSample, ]
# Train model with different values of K
model <- train.kknn(R1 ~ ., data = learningData, kmax = 9, scale=TRUE)
model
##
## Call:
## train.kknn(formula = R1 ~ ., data = learningData, kmax = 9, scale = TRUE)
##
## Type of response variable: continuous
## minimal mean absolute error: 0.2422777
## Minimal mean squared error: 0.1373601
## Best kernel: optimal
## Best k: 9
prediction <- predict(model, testingData[, -11])
prediction
## [1] 0.317254449 1.000000000 0.797797042 0.500065424 1.000000000
## [6] 0.559957878 0.667669619 0.202202958 0.510224074 0.544691990
## [11] 1.000000000 1.000000000 1.000000000 0.691330008 0.797797042
## [16] 0.957893489 1.000000000 1.000000000 0.797797042 1.000000000
## [21] 0.425054070 1.000000000 1.000000000 0.604838719 1.000000000
## [26] 1.000000000 0.895226705 0.552319283 1.000000000 1.000000000
## [31] 0.552319283 0.822281725 1.000000000 1.000000000 1.000000000
## [36] 0.775082515 0.086491289 0.354748691 0.313954670 0.564667734
## [41] 0.582584525 1.000000000 1.000000000 1.000000000 0.604838719
## [46] 0.691330008 1.000000000 0.905870116 0.720177753 0.661560647
## [51] 0.847454267 0.201542780 0.720177753 0.510212772 0.657510968
## [56] 0.455308010 0.861000577 0.350127626 0.393467543 0.835203403
## [61] 0.445029814 0.332617924 0.465839295 0.447557290 1.000000000
## [66] 0.473466588 0.671057278 0.842841998 0.755690531 0.913508711
## [71] 0.992361405 0.661560647 0.447557290 0.957893489 0.992361405
## [76] 0.822281725 0.797797042 0.582584525 1.000000000 0.884948509
## [81] 0.937333216 0.797797042 0.628663224 0.682745551 0.237839164
## [86] 0.937333216 0.481488455 1.000000000 0.850841926 0.929694621
## [91] 0.461215724 0.861000577 0.937333216 0.848775553 0.884948509
## [96] 1.000000000 0.696229820 0.905870116 0.884948509 1.000000000
## [101] 0.976052067 0.957893489 1.000000000 1.000000000 0.992361405
## [106] 0.691330008 1.000000000 1.000000000 0.427693671 0.181105934
## [111] 0.797797042 0.447680717 0.588492239 0.530199818 0.937333216
## [116] 1.000000000 0.797797042 1.000000000 0.683978956 1.000000000
## [121] 0.649223498 0.992361405 0.616691107 0.799030447 0.604838719
## [126] 0.691330008 1.000000000 0.937333216 1.000000000 0.933945557
## [131] 1.000000000 0.385013933 0.937333216 0.466412523 1.000000000
## [136] 0.968413472 0.937333216 1.000000000 1.000000000 0.786108769
## [141] 1.000000000 0.427120444 0.660031024 0.957893489 0.937333216
## [146] 0.573252191 0.177718275 0.297862465 0.402512333 0.517851367
## [151] 0.950254894 0.667382076 0.848775553 0.735790435 1.000000000
## [156] 0.437267792 0.824827620 1.000000000 0.597364239 0.735790435
## [161] 0.439381944 1.000000000 0.691330008 0.853361982 1.000000000
## [166] 0.691330008 0.000000000 0.007638595 0.115051491 0.202202958
## [171] 0.062666784 0.226150891 0.110439222 0.201542780 0.115051491
## [176] 0.115051491 0.151224447 0.453614272 0.086491289 0.086491289
## [181] 0.042106511 0.115051491 0.202202958 0.440042122 0.042106511
## [186] 0.157158002 0.062666784 0.042106511 0.000000000 0.122690086
## [191] 0.000000000 0.070305379 0.000000000 0.251411313 0.597076696
## [196] 0.499934576 0.000000000 0.000000000 0.115051491 0.308669992
## [201] 0.000000000 0.115051491 0.000000000 0.151224447 0.122690086
## [206] 0.202202958 0.086491289 0.598893862 1.000000000 0.062666784
## [211] 0.062666784 0.000000000 0.070305379 0.000000000 0.000000000
## [216] 0.007638595 0.007638595 0.062666784 0.000000000 0.042106511
## [221] 0.000000000 0.115051491 0.094129884 0.371336776 0.000000000
## [226] 0.151224447 0.371336776 0.913508711 0.000000000 0.000000000
## [231] 0.062666784 0.086491289 0.000000000 0.000000000 0.115051491
## [236] 0.062666784 0.042106511 0.062666784 0.151224447 0.086491289
## [241] 0.042106511 0.042106511 0.000000000 0.177718275 0.000000000
## [246] 0.000000000 0.308669992 0.151224447 0.007638595 0.202202958
## [251] 0.226150891 0.110439222 0.007638595 0.007638595 0.062666784
## [256] 0.308669992 0.151224447 0.000000000 0.128597800 0.423721483
## [261] 0.115051491 0.094129884 0.000000000 0.000000000 0.086491289
## [266] 0.000000000 0.086491289 0.122690086 0.308669992 0.243772719
## [271] 0.000000000 0.000000000 0.122690086 0.115051491 0.000000000
## [276] 0.177718275 0.086491289 0.308669992 0.000000000 0.115051491
## [281] 0.000000000 0.201542780 0.000000000 0.266275938 0.151224447
## [286] 0.000000000 0.151224447 0.158863042 0.000000000 0.000000000
## [291] 0.104773295 0.000000000 0.202202958 0.086491289 0.094129884
## [296] 0.000000000 0.049745106 0.151224447 0.000000000 0.000000000
## [301] 0.115051491 0.086491289 0.000000000 0.279822247 0.308669992
## [306] 0.115051491 0.202202958 0.042106511 0.086491289 0.000000000
## [311] 0.000000000 0.000000000 0.712415730 0.669075814 0.062666784
## [316] 0.000000000 0.790818625 0.465179117 0.419109214 0.625924441
## [321] 0.576278517 1.000000000 0.808735416 0.735130257 0.555093613
## [326] 0.992361405 0.871402200 0.573539734 0.403745739 0.992361405
## [331] 1.000000000 0.691330008 0.243772719 0.583905810 0.371049233
## [336] 0.352890655 1.000000000 0.881922183 0.300382521 0.264869743
## [341] 0.808735416 0.371049233 1.000000000 1.000000000 0.313954670
## [346] 0.215212517 0.383308893 1.000000000 0.366999555 0.992361405
## [351] 0.604838719 1.000000000 1.000000000 1.000000000 0.937333216
## [356] 0.933945557 0.604838719 0.797797042 0.518224002 0.797797042
## [361] 0.494026862 0.822281725 0.296456270 0.957893489 0.264869743
## [366] 1.000000000 1.000000000 0.937333216 0.526533412 1.000000000
## [371] 0.913508711 0.447020539 0.871402200 0.481488455 1.000000000
## [376] 1.000000000 1.000000000 0.913508711 0.086491289 0.425054070
## [381] 0.431360078 0.937333216 1.000000000 0.926306962 0.000000000
## [386] 0.062666784 0.086491289 0.007638595 0.151224447 0.042106511
## [391] 0.000000000 0.086491289 0.115051491 0.062666784 0.062666784
## [396] 0.128597800 0.151224447 0.151224447 0.308669992 0.000000000
## [401] 0.023947933 0.000000000 0.177718275 0.000000000 0.086614717
## [406] 0.000000000 0.000000000 0.023947933 0.000000000 0.000000000
## [411] 0.062666784 0.023947933 0.303770180 0.151224447 0.062666784
## [416] 0.311408775 0.175172380 0.000000000 0.115051491 0.115051491
## [421] 0.104773295 0.115051491 0.007638595 0.000000000 0.023947933
## [426] 0.000000000 0.000000000 0.042106511 0.007638595 0.317254449
## [431] 0.000000000 0.042106511 0.086491289 0.023947933 0.703667157
## [436] 0.202202958 0.086491289 0.062666784 0.062666784 0.042106511
## [441] 0.332617924 0.000000000 0.122690086 0.062666784
CM <- table(testingData[, 11], prediction)
CM
## prediction
## 0 0.00763859493855109 0.023947932599854 0.0421065107340216
## 0 57 9 5 12
## 1 1 0 0 0
## prediction
## 0.0497451056725726 0.0626667843170186 0.0703053792555696
## 0 1 16 1
## 1 0 2 1
## prediction
## 0.0864912894487642 0.0866147169168726 0.0941298843873153
## 0 13 1 3
## 1 4 0 0
## prediction
## 0.10477329505104 0.110439222048618 0.115051490895509 0.12269008583406
## 0 2 2 17 5
## 1 0 0 0 0
## prediction
## 0.128597800182786 0.151224447037734 0.157158001629531
## 0 2 13 1
## 1 0 0 0
## prediction
## 0.158863041976285 0.175172379637588 0.177718275212528
## 0 1 1 3
## 1 0 0 1
## prediction
## 0.181105934229385 0.201542780344273 0.202202958349245
## 0 0 3 7
## 1 1 0 1
## prediction
## 0.215212517099658 0.226150890949099 0.237839163954606
## 0 1 2 0
## 1 0 0 1
## prediction
## 0.243772718546403 0.251411313484954 0.264869742666263
## 0 2 1 1
## 1 0 0 1
## prediction
## 0.266275937933243 0.27982224722052 0.296456270204668 0.297862465471648
## 0 1 1 0 0
## 1 0 0 1 1
## prediction
## 0.300382520803517 0.303770179820374 0.308669991679303
## 0 1 1 7
## 1 0 0 0
## prediction
## 0.311408774758925 0.313954670333864 0.317254449244754
## 0 1 1 1
## 1 0 1 1
## prediction
## 0.332617924279157 0.350127626476089 0.352890654850116
## 0 2 1 1
## 1 0 0 0
## prediction
## 0.354748691131885 0.366999554917326 0.371049232984283
## 0 0 1 2
## 1 1 0 0
## prediction
## 0.371336775996321 0.383308892578629 0.385013932925384
## 0 2 1 0
## 1 0 0 1
## prediction
## 0.393467542849049 0.40251233305458 0.403745738693518 0.419109213727921
## 0 1 0 0 0
## 1 0 1 1 1
## prediction
## 0.423721482574812 0.425054070387454 0.427120443659405
## 0 1 0 0
## 1 0 2 1
## prediction
## 0.427693671293372 0.431360077513363 0.437267791862089 0.43938194429888
## 0 0 0 0 0
## 1 1 1 1 1
## prediction
## 0.440042122303851 0.445029814268748 0.447020539237431
## 0 1 1 0
## 1 0 0 1
## prediction
## 0.447557289774294 0.447680717242402 0.453614271834199
## 0 2 0 0
## 1 0 1 1
## prediction
## 0.455308010113217 0.461215724461943 0.465179117371598 0.46583929537657
## 0 1 0 0 1
## 1 0 1 1 0
## prediction
## 0.466412523010537 0.473466588247385 0.481488455032901
## 0 0 1 0
## 1 1 0 2
## prediction
## 0.494026861830381 0.499934576179107 0.500065423820893
## 0 0 1 0
## 1 1 0 1
## prediction
## 0.510212772023576 0.510224074091313 0.517851366962127 0.51822400195506
## 0 1 0 0 0
## 1 0 1 1 1
## prediction
## 0.526533411752615 0.530199817972606 0.544691989886783
## 0 0 0 0
## 1 1 1 1
## prediction
## 0.552319282757598 0.55509361319936 0.559957877696149 0.564667733768077
## 0 0 0 0 0
## 1 2 1 1 1
## prediction
## 0.573252191333528 0.573539734345566 0.576278517425188
## 0 0 0 0
## 1 1 1 1
## prediction
## 0.582584524551097 0.583905810296003 0.588492238899822
## 0 0 1 0
## 1 2 0 1
## prediction
## 0.597076696465274 0.597364239477312 0.5988938622124 0.604838718871933
## 0 1 0 1 0
## 1 0 1 0 5
## prediction
## 0.616691107421371 0.625924440924057 0.628663224003679
## 0 0 0 0
## 1 1 1 1
## prediction
## 0.649223497586676 0.657510968462462 0.66003102379433 0.661560646529418
## 0 0 1 0 2
## 1 1 0 1 0
## prediction
## 0.667382075720843 0.667669618732881 0.669075813999861
## 0 0 0 1
## 1 1 1 0
## prediction
## 0.671057277749738 0.682745550755246 0.683978956394184
## 0 1 0 0
## 1 0 1 1
## prediction
## 0.691330008320697 0.696229820179626 0.70366715726344 0.712415730372821
## 0 0 0 1 1
## 1 7 1 0 0
## prediction
## 0.72017775277948 0.735130257333737 0.735790435338708 0.755690530916734
## 0 2 0 0 1
## 1 0 1 2 0
## prediction
## 0.775082514689839 0.786108768645247 0.790818624717176
## 0 0 0 0
## 1 1 1 1
## prediction
## 0.797797041650755 0.799030447289693 0.808735415500196
## 0 0 0 1
## 1 9 1 1
## prediction
## 0.822281724787472 0.824827620362412 0.835203403431918
## 0 1 0 1
## 1 2 1 0
## prediction
## 0.842841998370469 0.84745426721736 0.848775552962266 0.850841926234217
## 0 1 1 0 0
## 1 0 0 2 1
## prediction
## 0.853361981566086 0.861000576504637 0.871402199817214
## 0 0 1 0
## 1 1 1 2
## prediction
## 0.881922183012831 0.884948509104491 0.89522670494896 0.905870115612685
## 0 1 0 0 0
## 1 0 3 1 2
## prediction
## 0.913508710551236 0.926306961727573 0.92969462074443 0.933945556666124
## 0 2 0 0 0
## 1 2 1 1 2
## prediction
## 0.937333215682981 0.950254894327427 0.957893489265979
## 0 0 0 1
## 1 11 1 4
## prediction
## 0.968413472461595 0.976052067400146 0.992361405061449 1
## 0 0 0 3 5
## 1 1 1 4 56
# calculate accuracy
accuracy <- (sum(diag(CM)))/sum(CM)
accuracy
## [1] 0.1283784
# Plot model
plot(model)
Second try varying the value of kmax
# Read the data from file
filename = "E:/mzambrano/Documents/OneDrive/Personaldocs/Backtoschool/GeorgiaTech/MicroMasters/GTx_ISYE6501x_IntroductionToAnalyticsModeling/02 - HW1/Data2.2/credit_card_data-headers.txt"
ccdata <- read.delim(filename, header=TRUE)
# Load library
library(kknn)
# some insight gotten from
# https://stackoverflow.com/questions/57649227/how-to-predict-in-kknn-function-librarykknn
# https://stackoverflow.com/questions/57649227/how-to-predict-in-kknn-function-librarykknn
# we take a sample of about a 1/3 for learning and 2/3 for testing
numSample <- sample(1:654, 210)
learningData <- ccdata[numSample, ]
testingData <- ccdata[-numSample, ]
# Auxiliary lists and variables
#---------------------------------------------------------------------------
# vector containing number of neighbords considered (4 to 14)
numnneighbors <- seq(4,30)
# vector with colors to apply in plot, let's get some random colors
#colsample <- sample(1:657,30)
# colors <- colors()[colsample]
# Data frame to store results
results <- data.frame("best kernel" = character(), "K passed" = integer(),"best K" = integer(), accuracy = numeric(), stringsAsFactors = FALSE)
#---------------------------------------------------------------------------
n <- 0
# main loop, get each kernel in vector
for (kpar in numnneighbors)
{
n <- n + 1
# Train model with different values of K
model <- train.kknn(as.factor(R1) ~ ., data = learningData, kmax = kpar, scale=TRUE)
prediction <- predict(model, testingData[, -11])
CM <- table(testingData[, 11], prediction)
# calculate accuracy
accuracy <- (sum(diag(CM)))/sum(CM)
results[nrow(results) + 1,] <- c(kpar,model[["best.parameters"]][["kernel"]],model[["best.parameters"]][["k"]],accuracy)
}
plot(results[,3],results[,4], type='o',ylab = "Accuracy", xlab = "Best K",main = "Accuracy vs Best K",col = "red")
results
## best.kernel K.passed best.K accuracy
## 1 4 optimal 1 0.813063063063063
## 2 5 optimal 5 0.835585585585586
## 3 6 optimal 5 0.835585585585586
## 4 7 optimal 5 0.835585585585586
## 5 8 optimal 5 0.835585585585586
## 6 9 optimal 5 0.835585585585586
## 7 10 optimal 5 0.835585585585586
## 8 11 optimal 5 0.835585585585586
## 9 12 optimal 5 0.835585585585586
## 10 13 optimal 5 0.835585585585586
## 11 14 optimal 5 0.835585585585586
## 12 15 optimal 5 0.835585585585586
## 13 16 optimal 5 0.835585585585586
## 14 17 optimal 5 0.835585585585586
## 15 18 optimal 5 0.835585585585586
## 16 19 optimal 5 0.835585585585586
## 17 20 optimal 5 0.835585585585586
## 18 21 optimal 5 0.835585585585586
## 19 22 optimal 5 0.835585585585586
## 20 23 optimal 5 0.835585585585586
## 21 24 optimal 5 0.835585585585586
## 22 25 optimal 5 0.835585585585586
## 23 26 optimal 5 0.835585585585586
## 24 27 optimal 5 0.835585585585586
## 25 28 optimal 5 0.835585585585586
## 26 29 optimal 5 0.835585585585586
## 27 30 optimal 5 0.835585585585586
And from the results we can easily determine the optimal value of K, in various runs it varies from 9 to 13 with an accuracy of 82% and better. BUT results do vary wildy in succesive runs so I think I need to better understand this, lookig forward to the homework review and what others have done when assessing peers.