library(ISLR)
library(MASS)
data("Weekly")
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
pairs(Weekly)
cor(Weekly[, -9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
attach(`Weekly`)
plot(Volume)
There are a couple of things I am noticing about the dataset that I think are worth mentioning after exploring it. My visual interpretation from the scatterplot matrix is that in it there were not any distinguishable correlations between individuals predictors, other than year and volume. When I looked at the correlation matrix of my variables, I noticed none of the variables (aside from volume and year) presented values beyond a magnitude of 0.1 positive/negative correlation, which makes sense given how the scatterplot matrix turned out. Looking a little closer at volume, It appears the trading volume has increased over time.
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
The only statistically significant predictor seems to be lag2
probs = predict(glm.fit, type = "response")
pred.glm = rep("Down", length(probs))
pred.glm[probs > 0.5] <- "Up"
table(pred.glm, Direction)
## Direction
## pred.glm Down Up
## Down 54 48
## Up 430 557
# Compute the fraction of days for which the prediction was correct
mean(pred.glm == Direction)
## [1] 0.5610652
# Compute missclassification rate
478/1089
## [1] 0.4389348
# Calculating the percent of times upward movement was predicted correctly
557/(557+48)
## [1] 0.9206612
# Calculating the percent of times downward movement was predicted correctly
54/(54+430)
## [1] 0.1115702
When looking at the confusion matrix, it seems that the model predicted the direction of the market correctly about 56% of the time, while our error rate was about 43.9%. The model also seemed to have predicted upward movement correctly about 92.1% of the time, while it was only able to predict downward movement correctly about 11% of the time. Upward and downward movement in this case are refering to weeks where the stock market goes Up or Down.
train = Weekly[Weekly$Year <= 2008, ]
test = Weekly[Weekly$Year > 2008, ]
testdirect = Weekly$Direction[Weekly$Year>2008]
glm.fit2 <- glm(Direction ~ Lag2, data = train, family = "binomial")
summary(glm.fit2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
#########
prob = predict(glm.fit2, type="response", newdata = test)
pred2 = rep("Down", 104)
pred2[prob>0.5] = "Up"
table(pred2, testdirect)
## testdirect
## pred2 Down Up
## Down 9 5
## Up 34 56
# Compute the fraction of days for which the prediction was correct
mean(pred2 == testdirect)
## [1] 0.625
# Compute missclassification rate
39/104
## [1] 0.375
# Calculating the percent of times upward movement was predicted correctly
56/61
## [1] 0.9180328
# Calculating the percent of times downward movement was predicted correctly
9/43
## [1] 0.2093023
We can see this time that the percentage of correct predictions for test data is 62.5%, which makes our test error rate about 37.5%. Compared to the confusion matrix generated from fitting a logistic regression model on the full data set (glm.fit), it seems this one (glm.fit2) predictied a higher percentage of outcomes. Also, for weeks when the market goes up, the model is right 91.8% of the time. For weeks when the market goes down, the model is right 20.9% of the time.
lda_fit = lda(Direction~Lag2, data = train)
lda_fit
## Call:
## lda(Direction ~ Lag2, data = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
lda_pred = predict(lda_fit, type="response", newdata = test)
names(lda_pred)
## [1] "class" "posterior" "x"
ldaclass = lda_pred$class
table(ldaclass, testdirect)
## testdirect
## ldaclass Down Up
## Down 9 5
## Up 34 56
We can see this time the output is the same as last time. The percentage of correct predictions for test data is 62.5%, which makes our test error rate about 37.5%. For weeks when the market goes up, the model is right 91.8% of the time. For weeks when the market goes down, the model is right 20.9% of the time.
qda_fit = qda(Direction~Lag2, data = train)
qda_fit
## Call:
## qda(Direction ~ Lag2, data = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda_pred = predict(qda_fit, type="response", newdata = test)
names(qda_pred)
## [1] "class" "posterior"
qdaclass = qda_pred$class
table(qdaclass, testdirect)
## testdirect
## qdaclass Down Up
## Down 0 0
## Up 43 61
# Compute the fraction of days for which the prediction was correct
61/104
## [1] 0.5865385
# Compute missclassification rate
43/104
## [1] 0.4134615
We can see this time the output is NOT the same as the last 2 times. The percentage of correct predictions for test data is 58.7%, which makes our test error rate about 41.3%. For weeks when the market goes up, the model is right 100% of the time. For weeks when the market goes down, the model is right 0% of the time.
library(class)
training1 = cbind(train$Lag2)
training2 = cbind(train$Direction)
testing1 = cbind(test$Lag2)
set.seed(1)
pred_K = knn(training1, testing1, training2, k=1)
table(pred_K, testdirect)
## testdirect
## pred_K Down Up
## 1 21 30
## 2 22 31
The percentage of correct predictions for test data is 50%, which makes our test error rate about 50%. For weeks when the market goes up, the model is right 50.8% of the time. For weeks when the market goes down, the model is right 48.8% of the time
Lda and Logistic regression seemed to provide the best results, as they had the lowest misclassification rates and highest rates of corret classification.
logistic regression experiment
#adding interaction term to logistic regression example.
INTXNglm.fit = glm(Direction ~ Lag1 + Lag3*Lag4, data = train, family = binomial)
summary(INTXNglm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag3 * Lag4, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.524 -1.251 1.011 1.090 1.559
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2284079 0.0649830 3.515 0.00044 ***
## Lag1 -0.0604336 0.0291941 -2.070 0.03845 *
## Lag3 -0.0004513 0.0305567 -0.015 0.98822
## Lag4 -0.0236111 0.0294339 -0.802 0.42245
## Lag3:Lag4 0.0124882 0.0081777 1.527 0.12674
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1347.3 on 980 degrees of freedom
## AIC: 1357.3
##
## Number of Fisher Scoring iterations: 4
#getting probabilities and table
probb = predict(INTXNglm.fit, type="response", newdata = test)
predd = rep("Down", 104)
predd[probb>0.5] = "Up"
table(predd, testdirect)
## testdirect
## predd Down Up
## Down 7 6
## Up 36 55
#error rate
42/104
## [1] 0.4038462
LDA experiment
lda_fit2 = lda(Direction~Lag1 + Lag3*Lag4, data = train)
lda_pred2 = predict(lda_fit2, type="response", newdata = test)
ldaclass2 = lda_pred2$class
table(ldaclass2, testdirect)
## testdirect
## ldaclass2 Down Up
## Down 6 6
## Up 37 55
#error rate
43/104
## [1] 0.4134615
QDA experiment
qda_fit2 = qda(Direction~Lag1 + Lag3*Lag4, data = train)
qda_pred2 = predict(qda_fit2, type="response", newdata = test)
qdaclass2 = qda_pred2$class
table(qdaclass2, testdirect)
## testdirect
## qdaclass2 Down Up
## Down 24 31
## Up 19 30
#error rate
50/104
## [1] 0.4807692
KNN Runs k = 50, k = 20, k = 4
training3 = cbind(train$Lag4)
training4 = cbind(train$Direction)
testing2 = cbind(test$Lag4)
set.seed(1)
pred_K50 = knn(training3, testing2, training4, k=50)
table(pred_K50, testdirect)
## testdirect
## pred_K50 Down Up
## 1 18 16
## 2 25 45
#Errort Rate k = 50
41/104
## [1] 0.3942308
set.seed(1)
pred_K20 = knn(training3, testing2, training4, k=20)
table(pred_K20, testdirect)
## testdirect
## pred_K20 Down Up
## 1 19 19
## 2 24 42
#Error Rate k = 20
43/104
## [1] 0.4134615
set.seed(1)
pred_K4 = knn(training3, testing2, training4, k=4)
table(pred_K4, testdirect)
## testdirect
## pred_K4 Down Up
## 1 20 22
## 2 23 39
#Error Rate k = 4
45/104
## [1] 0.4326923
After experimenting with a a couple of different models and methods, what I noticed was the best error rate, as in lowest, came from doing a KNN on Lag4 with k = 50. That specific error rate was about 39.4%. Doing a logistic regression model on Lag1 + Lag3 * Lag4 resulted in a close 2nd place, with an error rate at about 40.3%.
attach(Auto)
mpg01 <- rep(0, length(mpg))
mpg01[mpg > median(mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
pairs(Auto)
cor(Auto[-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
par(mfrow=c(2,2))
boxplot(cylinders ~ mpg01, data = Auto, main = "cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "horsepower vs mpg01")
When doing a scatterplot matrix and correlation matrix I noticed that there were a couple of variables who’s correlation value seemed of a high enough magnitude to be worth taking a closer look at. These variables include cylinders, displacement, and horsepower.
train <- (year %% 2 == 0)
Autotrain <- Auto[train, ]
Autotest <- Auto[!train, ]
mpg01test <- mpg01[!train]
fit.lda = lda(mpg01 ~ cylinders + displacement + horsepower, data = Autotrain)
pred.lda = predict(fit.lda, Autotest)
table(pred.lda$class, mpg01test)
## mpg01test
## 0 1
## 0 88 10
## 1 12 72
#Error rate
22/182
## [1] 0.1208791
The test error rate appears to be about 12.1% in this case
fit.qda = qda(mpg01 ~ cylinders + displacement + horsepower, data = Autotrain)
pred.qda = predict(fit.qda, Autotest)
table(pred.qda$class, mpg01test)
## mpg01test
## 0 1
## 0 88 11
## 1 12 71
#Error rate
23/182
## [1] 0.1263736
The test error rate appears to be about 12.6% in this case
fit.glm = glm(mpg01 ~ cylinders + displacement + horsepower, data = Autotrain, family = binomial)
autoprobs = predict(fit.glm, Autotest, type = "response")
autopreds = rep(0, length(autoprobs))
autopreds[autoprobs > 0.5] = 1
table(autopreds, mpg01test)
## mpg01test
## autopreds 0 1
## 0 88 9
## 1 12 73
#error rate
21/182
## [1] 0.1153846
The test error rate appears to be about 11.5% in this case
train.X = cbind(cylinders, displacement, horsepower)[train, ]
test.X = cbind(cylinders, displacement, horsepower)[!train, ]
train.mpg01 = mpg01[train]
set.seed(1)
pred_knn1 = knn(train.X, test.X, train.mpg01, k = 1)
table(pred_knn1, mpg01test)
## mpg01test
## pred_knn1 0 1
## 0 90 12
## 1 10 70
#Error Rate k = 1
22/182
## [1] 0.1208791
set.seed(1)
pred_knn5 = knn(train.X, test.X, train.mpg01, k = 5)
table(pred_knn5, mpg01test)
## mpg01test
## pred_knn5 0 1
## 0 88 13
## 1 12 69
#Errort Rate k = 5
25/182
## [1] 0.1373626
set.seed(1)
pred_knn15 = knn(train.X, test.X, train.mpg01, k = 15)
table(pred_knn15, mpg01test)
## mpg01test
## pred_knn15 0 1
## 0 84 5
## 1 16 77
#Error Rate k = 15
21/182
## [1] 0.1153846
set.seed(1)
pred_knn40 = knn(train.X, test.X, train.mpg01, k = 40)
table(pred_knn40, mpg01test)
## mpg01test
## pred_knn40 0 1
## 0 84 4
## 1 16 78
#Error Rate k = 40
20/182
## [1] 0.1098901
The test error rates are given for 4 different values of k. For k=1, the error rate was about 12.1%. For k=4, the test error rate was about 13.7%. For k=15, the error rate was about 11.5%. For k=40 we get the smallest error rate, which was about 11%, and I would have to conclude with this information that KNN with k = 40 performed the best overall.
First I’d like to see what variables appear to be significant in a logistic regression model with all the predictors, aside from crim and my newly created variable crime1. After this, I’m going to run a confusion matrix to keep and compare to the rest of my confusion matrices. After that, I’m going to recreate the same process, but only with the variables in the full model that had significant pvalues.
attach(Boston)
crime1 = rep(0, length(crim))
crime1[crim > median(crim)] <- 1
Bdf = data.frame(Boston, crime1)
train = 1:(length(crim) / 2)
test = (length(crim) / 2 + 1):length(crim)
Btrain = Bdf[train, ]
Btest = Bdf[test, ]
crim01test = crime1[test]
crime.fglm = glm(crime1~. - crime1 - crim, data = Bdf, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(crime.fglm)
##
## Call:
## glm(formula = crime1 ~ . - crime1 - crim, family = binomial,
## data = Bdf, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.83229 -0.06593 0.00000 0.06181 2.61513
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -91.319906 19.490273 -4.685 2.79e-06 ***
## zn -0.815573 0.193373 -4.218 2.47e-05 ***
## indus 0.354172 0.173862 2.037 0.04164 *
## chas 0.167396 0.991922 0.169 0.86599
## nox 93.706326 21.202008 4.420 9.88e-06 ***
## rm -4.719108 1.788765 -2.638 0.00833 **
## age 0.048634 0.024199 2.010 0.04446 *
## dis 4.301493 0.979996 4.389 1.14e-05 ***
## rad 3.039983 0.719592 4.225 2.39e-05 ***
## tax -0.006546 0.007855 -0.833 0.40461
## ptratio 1.430877 0.359572 3.979 6.91e-05 ***
## black -0.017552 0.006734 -2.606 0.00915 **
## lstat 0.190439 0.086722 2.196 0.02809 *
## medv 0.598533 0.185514 3.226 0.00125 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 329.367 on 252 degrees of freedom
## Residual deviance: 69.568 on 239 degrees of freedom
## AIC: 97.568
##
## Number of Fisher Scoring iterations: 10
#error rate for full glm model
Bprobs <- predict(crime.fglm, Btest, type = "response")
Bpreds <- rep(0, length(Bprobs))
Bpreds[Bprobs > 0.5] <- 1
table(Bpreds, crim01test)
## crim01test
## Bpreds 0 1
## 0 68 24
## 1 22 139
#error rate
46/253
## [1] 0.1818182
Running the first logistic regression model we generate an error rate of about 18.2%.
Now to create a logistic regression model using only the significant terms. After this gets created I’ll generate my next confusion matrix to get the error rate.
crime.fglm2 = glm(crime1~. - crime1 - crim - chas - tax, data = Bdf, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Bprobs2 <- predict(crime.fglm2, Btest, type = "response")
Bpreds2 <- rep(0, length(Bprobs2))
Bpreds2[Bprobs2 > 0.5] <- 1
table(Bpreds2, crim01test)
## crim01test
## Bpreds2 0 1
## 0 67 24
## 1 23 139
#error rate
47/253
## [1] 0.1857708
This logistic regression model containing all the significant predictor variable from crime.fglm ended up givng an error rare of about 18.6%. Not much different from the first model, but slightly worse…
I’m wondering if LDA will prouce a model that predicts at a lower error rate when applied to the same set of predictor variables as crime.fglm2. Lets see how it works.
crimefit.lda = lda(crime1 ~ . - crime1 - crim - chas - tax, data = Bdf, subset = train)
crimepred.lda = predict(crimefit.lda, Btest)
table(crimepred.lda$class, crim01test)
## crim01test
## 0 1
## 0 80 21
## 1 10 142
#error rate
31/253
## [1] 0.1225296
The LDA method gave us a lower error rate than the previous two runs, this time being about 12.3%.
I’m going to see how KNN method does with predicting on the same set of variables used in crime.fglm2 and crimefit.lda. Then, I will experiment with a couple of differen values of K to see if I can find a low error rate
crimetrain = cbind(zn, indus, nox, rm, age, dis, rad, ptratio, black, lstat, medv)[train, ]
crimetest = cbind(zn, indus, nox, rm, age, dis, rad, ptratio, black, lstat, medv)[test, ]
traincrime = crime1[train]
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 1)
table(Cpredk, traincrime)
## traincrime
## Cpredk 0 1
## 0 116 49
## 1 47 41
#error rate
96/253
## [1] 0.3794466
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 2)
table(Cpredk, traincrime)
## traincrime
## Cpredk 0 1
## 0 101 41
## 1 62 49
#error rate
103/253
## [1] 0.4071146
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 4)
table(Cpredk, traincrime)
## traincrime
## Cpredk 0 1
## 0 99 34
## 1 64 56
#error rate
98/253
## [1] 0.3873518
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 8)
table(Cpredk, traincrime)
## traincrime
## Cpredk 0 1
## 0 102 32
## 1 61 58
#error rate
93/253
## [1] 0.3675889
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 16)
table(Cpredk, traincrime)
## traincrime
## Cpredk 0 1
## 0 99 34
## 1 64 56
#error rate
98/253
## [1] 0.3873518
I ended up using K = 1,2,4,8 and 16. It appears the error rates vary between about 36.8%(k=8) and about 40.7%(k=2). Compared to my other methods, the error rates derived from the KNN method are much higher.
For the experiments that were trialed above on the Boston data set it appears the best method for predicting whether a given suburb has a crime rate above or below the median was the LDA method at an error rate of about 12.3%