Looking at our plot of weekly we don’t see any significant linear relationships. We’re given a bunch of quantitative variables with only one factor variable being Direction. Looking specfically at the direction variable that over the time of this data we have more instances of up, rather than down, but they figures are still fairly close. Next I wanted to look at the correlaion of the variables, and we really see that most have a negative relationship except for a few of the lags. The only real take away from this is that we see a high correlation between volume and year. Because of this correlation I wanted to get a quick graph of the two variables.Looking at the graph we see a steady increase in volume until about year 2000 where the increase becomes very large, and only grows from there.
plot(Weekly)
#View(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
head(Weekly)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
table(Weekly$Direction)/sum(table(Weekly$Direction))
##
## Down Up
## 0.4444444 0.5555556
plot(Weekly$Direction)
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
ggplot(Weekly, aes(x = Year, y = Volume))+
geom_line(size = 1)+
theme_gray()
After running the logestic regression model we see that the only significant variable is Lag2, assuming a level significance of 5%.
week.glm = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = 'binomial')
summary(week.glm)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Looking at the confusion matrix we see that our accurracy for prediction is pretty low, 56%. Looking closer We see that we have a large number of false positives.
pred = predict.glm(week.glm, type = 'response')
pred_direction = ifelse(pred >= 0.5, "Up", "Down")
caret::confusionMatrix(as.factor(Weekly$Direction), as.factor(pred_direction))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 54 430
## Up 48 557
##
## Accuracy : 0.5611
## 95% CI : (0.531, 0.5908)
## No Information Rate : 0.9063
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.52941
## Specificity : 0.56434
## Pos Pred Value : 0.11157
## Neg Pred Value : 0.92066
## Prevalence : 0.09366
## Detection Rate : 0.04959
## Detection Prevalence : 0.44444
## Balanced Accuracy : 0.54687
##
## 'Positive' Class : Down
##
Fitting it with the data set we see that we now get a prediction rate of 62.5%. We still have a higher false positive rate, but its much lower than before.
week_train = Weekly[Weekly$Year <= 2008, ]
week_test = Weekly[Weekly$Year > 2008, ]
week.glm.train = glm(Direction ~ Lag2, data = week_train, family = 'binomial')
#summary(week.glm.train)
pred2 = predict.glm(week.glm.train, newdata = week_test, type = 'response')
pred_direction2 = ifelse(pred2 >= 0.5, "Up", "Down")
caret::confusionMatrix(as.factor(week_test$Direction), as.factor(pred_direction2))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 34
## Up 5 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.8654
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.64286
## Specificity : 0.62222
## Pos Pred Value : 0.20930
## Neg Pred Value : 0.91803
## Prevalence : 0.13462
## Detection Rate : 0.08654
## Detection Prevalence : 0.41346
## Balanced Accuracy : 0.63254
##
## 'Positive' Class : Down
##
We get an accuracy of 62.5% from lda.
week.lda = lda(Direction ~ Lag2, data = week_train)
week.lda
## Call:
## lda(Direction ~ Lag2, data = week_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
pred.lda = predict(week.lda, week_test)
confusionMatrix(data = pred.lda$class, reference = week_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.20930
## Specificity : 0.91803
## Pos Pred Value : 0.64286
## Neg Pred Value : 0.62222
## Prevalence : 0.41346
## Detection Rate : 0.08654
## Detection Prevalence : 0.13462
## Balanced Accuracy : 0.56367
##
## 'Positive' Class : Down
##
Accuracy for QDA is 58%, but it should be noted that it predicts Up every time
week.qda = qda(Direction ~ Lag2, data = week_train)
week.qda
## Call:
## qda(Direction ~ Lag2, data = week_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
pred.qda = predict(week.qda, week_test)
confusionMatrix(data = pred.qda$class, reference = week_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5865
## Prevalence : 0.4135
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Down
##
Using KNN we get an accuracy score of 50%
set.seed(1)
pred.knn = knn(data.frame(week_train$Lag2),
data.frame(week_test$Lag2),
cl = week_train$Direction,
k = 1,
prob = T)
confusionMatrix(pred.knn, week_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 30
## Up 22 31
##
## Accuracy : 0.5
## 95% CI : (0.4003, 0.5997)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9700
##
## Kappa : -0.0033
##
## Mcnemar's Test P-Value : 0.3317
##
## Sensitivity : 0.4884
## Specificity : 0.5082
## Pos Pred Value : 0.4118
## Neg Pred Value : 0.5849
## Prevalence : 0.4135
## Detection Rate : 0.2019
## Detection Prevalence : 0.4904
## Balanced Accuracy : 0.4983
##
## 'Positive' Class : Down
##
Looking at the accuracies we see that LDA and logistic regression both are the highest at 62.5%
First i’ll just the logistic regression using the interaction between Lag2 and Lag1 becuase they were one of the few lags that were correlated. After running the confusion matrix we return an accuracy rating of only 55.6%
week.glm2 = glm(Direction ~ Lag2:Lag1, data = Weekly, family = 'binomial')
summary(week.glm2)
##
## Call:
## glm(formula = Direction ~ Lag2:Lag1, family = "binomial", data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.362 -1.274 1.073 1.084 1.302
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.225716 0.061080 3.695 0.00022 ***
## Lag2:Lag1 0.005982 0.006285 0.952 0.34113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1495.3 on 1087 degrees of freedom
## AIC: 1499.3
##
## Number of Fisher Scoring iterations: 4
pred2 = predict.glm(week.glm2, type = 'response')
pred_direction2 = ifelse(pred2 >= 0.5, "Up", "Down")
caret::confusionMatrix(as.factor(Weekly$Direction), as.factor(pred_direction2))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 4 480
## Up 3 602
##
## Accuracy : 0.5565
## 95% CI : (0.5264, 0.5863)
## No Information Rate : 0.9936
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0037
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.571429
## Specificity : 0.556377
## Pos Pred Value : 0.008264
## Neg Pred Value : 0.995041
## Prevalence : 0.006428
## Detection Rate : 0.003673
## Detection Prevalence : 0.444444
## Balanced Accuracy : 0.563903
##
## 'Positive' Class : Down
##
Continuing with the same interaction terms we compute the confusion matrix for lda and return a score of 57.69%.
week.lda2 = lda(Direction ~ Lag2:Lag1, data = week_train)
week.lda2
## Call:
## lda(Direction ~ Lag2:Lag1, data = week_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2:Lag1
## Down -0.8014495
## Up -0.1393632
##
## Coefficients of linear discriminants:
## LD1
## Lag2:Lag1 0.1013404
pred.lda2 = predict(week.lda2, week_test)
confusionMatrix(data = pred.lda2$class, reference = week_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 1
## Up 43 60
##
## Accuracy : 0.5769
## 95% CI : (0.4761, 0.6732)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.6193
##
## Kappa : -0.0192
##
## Mcnemar's Test P-Value : 6.37e-10
##
## Sensitivity : 0.000000
## Specificity : 0.983607
## Pos Pred Value : 0.000000
## Neg Pred Value : 0.582524
## Prevalence : 0.413462
## Detection Rate : 0.000000
## Detection Prevalence : 0.009615
## Balanced Accuracy : 0.491803
##
## 'Positive' Class : Down
##
For QDA I decided to experiment with a different interaction, that being between Lag2 and Lag3, which after running the confusion matrix gave us an score of 56.7%. We primarily made this change as the interaction between Lag2 and Lag1 gave us a very low score with QDA.
week.qda2 = qda(Direction ~ Lag2:Lag3, data = week_train)
week.qda2
## Call:
## qda(Direction ~ Lag2:Lag3, data = week_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2:Lag3
## Down -0.1937158
## Up -0.6405132
pred.qda2 = predict(week.qda2, week_test)
confusionMatrix(data = pred.qda2$class, reference = week_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 6 8
## Up 37 53
##
## Accuracy : 0.5673
## 95% CI : (0.4665, 0.6641)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.6921
##
## Kappa : 0.0093
##
## Mcnemar's Test P-Value : 2.993e-05
##
## Sensitivity : 0.13953
## Specificity : 0.86885
## Pos Pred Value : 0.42857
## Neg Pred Value : 0.58889
## Prevalence : 0.41346
## Detection Rate : 0.05769
## Detection Prevalence : 0.13462
## Balanced Accuracy : 0.50419
##
## 'Positive' Class : Down
##
For KNN we ended setting k = 10 as it was able to slightly improve our score over 5. Still though, its accuracy was only raised a few points to 54.81%
set.seed(1)
pred.knn5 = knn(data.frame(week_train$Lag2),
data.frame(week_test$Lag2),
cl = week_train$Direction,
k = 10,
prob = T)
confusionMatrix(pred.knn5, week_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 17 21
## Up 26 40
##
## Accuracy : 0.5481
## 95% CI : (0.4474, 0.6459)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.8152
##
## Kappa : 0.052
##
## Mcnemar's Test P-Value : 0.5596
##
## Sensitivity : 0.3953
## Specificity : 0.6557
## Pos Pred Value : 0.4474
## Neg Pred Value : 0.6061
## Prevalence : 0.4135
## Detection Rate : 0.1635
## Detection Prevalence : 0.3654
## Balanced Accuracy : 0.5255
##
## 'Positive' Class : Down
##
Overall, even with some experimentation we weren’t able to get better than our initial 62.5% accuracy score produced from the GLM and LDA.
median(Auto$mpg)
## [1] 22.75
mpg01 = as.factor(ifelse(Auto$mpg >= 22.75, 1, 0))
str(mpg01)
## Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
table(mpg01)
## mpg01
## 0 1
## 196 196
Auto$origin = factor(Auto$origin, labels = c("American", "European", "Japanese"))
auto = data.frame(Auto, mpg01)
head(auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 American
## 2 15 8 350 165 3693 11.5 70 American
## 3 18 8 318 150 3436 11.0 70 American
## 4 16 8 304 150 3433 12.0 70 American
## 5 17 8 302 140 3449 10.5 70 American
## 6 15 8 429 198 4341 10.0 70 American
## name mpg01
## 1 chevrolet chevelle malibu 0
## 2 buick skylark 320 0
## 3 plymouth satellite 0
## 4 amc rebel sst 0
## 5 ford torino 0
## 6 ford galaxie 500 0
plot(auto)
boxplot(auto$horsepower ~ auto$mpg01, auto)
boxplot(auto$cylinders ~ auto$mpg01, auto)
boxplot(auto$weight ~ auto$mpg01, auto)
boxplot(auto$year ~ auto$mpg01, auto)
set.seed(1)
sub = sample(nrow(auto), nrow(auto) * .6)
auto.train = auto[sub, ]
auto.test = auto[-sub, ]
After running LDA on the variables most associated with mpg01 we returned a test error rate of 13%
auto.train.lda = lda(mpg01 ~ cylinders + displacement + horsepower + weight, data = auto.train)
auto.test.pred = predict(auto.train.lda, auto.test)
auto.test.lda.class = auto.test.pred$class
#table(auto.test$mpg01, auto.test.lda.class)
mean(auto.test.lda.class!=auto.test$mpg01)
## [1] 0.133758
Running QDA we return a test error rate of 12%
auto.train.qda = qda(mpg01 ~ cylinders + displacement + horsepower + weight, data = auto.train)
auto.test.pred.qda = predict(auto.train.qda, auto.test)
auto.test.qda.class = auto.test.pred.qda$class
mean(auto.test.qda.class!=auto.test$mpg01)
## [1] 0.1210191
We get the same error rate as we did with QDA for logistic regression, 12%.
auto.train.glm = glm(mpg01 ~ cylinders + displacement + horsepower + weight,
data = auto.train, family = binomial)
auto.test.pred.glm = predict(auto.train.glm, auto.test, type = 'response')
auto.test.glm.class = ifelse(auto.test.pred.glm > .5, 1, 0)
mean(auto.test.glm.class!=auto.test$mpg01)
## [1] 0.1210191
Running KNN we can look at k = 1:20 where we see that k = 9 preforms the best. We do end up computing a very high error rate of 50%.
set.seed(1)
knn.mpg = train(mpg01 ~ cylinders + displacement + weight + horsepower, data = auto.train,
method = 'knn', tuneGrid = expand.grid(k=seq(1,20,1)))
knn.mpg
## k-Nearest Neighbors
##
## 235 samples
## 4 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 235, 235, 235, 235, 235, 235, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.8587632 0.7167391
## 2 0.8611897 0.7221149
## 3 0.8627032 0.7252875
## 4 0.8727974 0.7449759
## 5 0.8718777 0.7429520
## 6 0.8727477 0.7446311
## 7 0.8794954 0.7582691
## 8 0.8784786 0.7562528
## 9 0.8807195 0.7606923
## 10 0.8800241 0.7593469
## 11 0.8772008 0.7535948
## 12 0.8740404 0.7475105
## 13 0.8753979 0.7501298
## 14 0.8720116 0.7434766
## 15 0.8726215 0.7447116
## 16 0.8767539 0.7527069
## 17 0.8733906 0.7459947
## 18 0.8733731 0.7460016
## 19 0.8716154 0.7424155
## 20 0.8715201 0.7422480
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
pred.knn = predict(knn.mpg, data = auto.test, type = 'raw')
mean(pred.knn != auto.test$mpg01)
## Warning in `!=.default`(pred.knn, auto.test$mpg01): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## [1] 0.5021277
median(Boston$crim)
## [1] 0.25651
Boston$crim = factor(ifelse(Boston$crim > .25651, 1, 0))
Running glm we get an accuracy of 91.5%
boston.glm = glm(crim ~ ., Boston, family = binomial)
summary(boston.glm)
##
## Call:
## glm(formula = crim ~ ., family = binomial, data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3946 -0.1585 -0.0004 0.0023 3.4239
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.103704 6.530014 -5.223 1.76e-07 ***
## zn -0.079918 0.033731 -2.369 0.01782 *
## indus -0.059389 0.043722 -1.358 0.17436
## chas 0.785327 0.728930 1.077 0.28132
## nox 48.523782 7.396497 6.560 5.37e-11 ***
## rm -0.425596 0.701104 -0.607 0.54383
## age 0.022172 0.012221 1.814 0.06963 .
## dis 0.691400 0.218308 3.167 0.00154 **
## rad 0.656465 0.152452 4.306 1.66e-05 ***
## tax -0.006412 0.002689 -2.385 0.01709 *
## ptratio 0.368716 0.122136 3.019 0.00254 **
## black -0.013524 0.006536 -2.069 0.03853 *
## lstat 0.043862 0.048981 0.895 0.37052
## medv 0.167130 0.066940 2.497 0.01254 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 701.46 on 505 degrees of freedom
## Residual deviance: 211.93 on 492 degrees of freedom
## AIC: 239.93
##
## Number of Fisher Scoring iterations: 9
pred.glm.bos = predict.glm(boston.glm, type = 'response')
pred.bos = ifelse(pred.glm.bos >= 0.5, 1, 0)
caret::confusionMatrix(as.factor(Boston$crim), as.factor(pred.bos))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 234 19
## 1 24 229
##
## Accuracy : 0.915
## 95% CI : (0.8872, 0.9378)
## No Information Rate : 0.5099
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.83
##
## Mcnemar's Test P-Value : 0.5419
##
## Sensitivity : 0.9070
## Specificity : 0.9234
## Pos Pred Value : 0.9249
## Neg Pred Value : 0.9051
## Prevalence : 0.5099
## Detection Rate : 0.4625
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9152
##
## 'Positive' Class : 0
##
For LDA we return a accuracy score of 85.5%
boston.lda = lda(crim ~ ., Boston)
pred.lda.boston = predict(boston.lda, Boston)
confusionMatrix(data = pred.lda.boston$class, reference = Boston$crim)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 240 60
## 1 13 193
##
## Accuracy : 0.8557
## 95% CI : (0.8221, 0.8852)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7115
##
## Mcnemar's Test P-Value : 7.289e-08
##
## Sensitivity : 0.9486
## Specificity : 0.7628
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.9369
## Prevalence : 0.5000
## Detection Rate : 0.4743
## Detection Prevalence : 0.5929
## Balanced Accuracy : 0.8557
##
## 'Positive' Class : 0
##
Running the model with QDA we return an accuracy of 90.5%
boston.qda = qda(crim ~ ., Boston)
pred.qda.boston = predict(boston.qda, Boston)
confusionMatrix(data = pred.qda.boston$class, reference = Boston$crim)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 252 47
## 1 1 206
##
## Accuracy : 0.9051
## 95% CI : (0.8762, 0.9292)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8103
##
## Mcnemar's Test P-Value : 8.293e-11
##
## Sensitivity : 0.9960
## Specificity : 0.8142
## Pos Pred Value : 0.8428
## Neg Pred Value : 0.9952
## Prevalence : 0.5000
## Detection Rate : 0.4980
## Detection Prevalence : 0.5909
## Balanced Accuracy : 0.9051
##
## 'Positive' Class : 0
##
Using KNN we return an accuracy score of 91.6% when k= 1.
set.seed(1)
boston.knn = train(crim ~ ., Boston, method = 'knn', tuneGrid = expand.grid(k=seq(1,20,1)))
boston.knn
## k-Nearest Neighbors
##
## 506 samples
## 13 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 506, 506, 506, 506, 506, 506, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9167006 0.8331366
## 2 0.9091468 0.8179958
## 3 0.9020065 0.8036794
## 4 0.9008807 0.8014285
## 5 0.9056188 0.8108326
## 6 0.9042593 0.8081688
## 7 0.9016401 0.8028562
## 8 0.8981840 0.7960143
## 9 0.8934816 0.7865572
## 10 0.8909119 0.7814292
## 11 0.8883550 0.7762695
## 12 0.8864353 0.7724074
## 13 0.8829908 0.7655497
## 14 0.8821341 0.7638773
## 15 0.8806973 0.7610447
## 16 0.8767248 0.7531522
## 17 0.8714638 0.7426165
## 18 0.8713321 0.7422955
## 19 0.8682418 0.7361468
## 20 0.8655964 0.7308633
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
Overall it became a close decision between GLM and KNN, but KNN slightly wins out with the better accuracy score.