This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
library(ISLR2); library(corrplot); library(MASS); library(caret);
## corrplot 0.92 loaded
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
## Loading required package: ggplot2
## Loading required package: lattice
library(car); library(dplyr); library(class);library(e1071)
## Loading required package: carData
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pairs(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
# Looking at the relationships between the numeric variables
weekly_num <- dplyr::select_if(Weekly, is.numeric)
M = cor(weekly_num)
corrplot(M, method = c("number"))
Volume and Year appear to be correlated.
m1 = glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,data = Weekly, family = binomial)
summary(m1)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
vif(m1)
## Lag1 Lag2 Lag3 Lag4 Lag5 Volume
## 1.017298 1.023770 1.019378 1.027388 1.014676 1.024789
Lag2 appears to be statistically significant.
# Predicting the responses on m1
predprob_log <- predict.glm(m1, Weekly, type = "response")
predclass_log = ifelse(predprob_log >= 0.5, "Up", "Down")
# Confusion matrix
caret::confusionMatrix(as.factor(predclass_log), Weekly$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 54 48
## Up 430 557
##
## Accuracy : 0.5611
## 95% CI : (0.531, 0.5908)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 0.369
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9207
## Specificity : 0.1116
## Pos Pred Value : 0.5643
## Neg Pred Value : 0.5294
## Prevalence : 0.5556
## Detection Rate : 0.5115
## Detection Prevalence : 0.9063
## Balanced Accuracy : 0.5161
##
## 'Positive' Class : Up
##
# Accuracy : 0.5611
# Sensitivity : 0.9207
# Specificity : 0.1116
The model has high sensitivity and low specificity, meaning it correctly predicts up direction, but is quite bad at predicting the down direction. Overall, accuracy is 56%.
#split into train and test
weekly_train = Weekly %>% filter(Weekly$Year < 2009)
weekly_test = Weekly %>% filter(Weekly$Year > 2008)
m2 = glm(formula = Direction ~ Lag2,data = weekly_train, family = binomial)
summary(m2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = weekly_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
# Predicting the responses on m2
predprob_log2 <- predict.glm(m2, weekly_test, type = "response")
predclass_log2 = ifelse(predprob_log2 >= 0.5, "Up", "Down")
# Confusion matrix
caret::confusionMatrix(as.factor(predclass_log2), weekly_test$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.9180
## Specificity : 0.2093
## Pos Pred Value : 0.6222
## Neg Pred Value : 0.6429
## Prevalence : 0.5865
## Detection Rate : 0.5385
## Detection Prevalence : 0.8654
## Balanced Accuracy : 0.5637
##
## 'Positive' Class : Up
##
lda.model = lda(Direction ~ Lag2, data = weekly_train)
lda.model
## Call:
## lda(Direction ~ Lag2, data = weekly_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
predictions.lda = predict(lda.model, weekly_test)
caret::confusionMatrix(as.factor(predictions.lda$class), weekly_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.20930
## Specificity : 0.91803
## Pos Pred Value : 0.64286
## Neg Pred Value : 0.62222
## Prevalence : 0.41346
## Detection Rate : 0.08654
## Detection Prevalence : 0.13462
## Balanced Accuracy : 0.56367
##
## 'Positive' Class : Down
##
qda.model = qda(Direction ~ Lag2, data = weekly_train)
qda.model
## Call:
## qda(Direction ~ Lag2, data = weekly_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
predictions.qda = predict(qda.model, weekly_test)
caret::confusionMatrix(as.factor(predictions.qda$class), weekly_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5865
## Prevalence : 0.4135
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Down
##
#convert direction
weekly_train$Direction_dummy <- ifelse(weekly_train$Direction == "Up", 1, 0)
weekly_test$Direction_dummy <- ifelse(weekly_test$Direction == "Up", 1, 0)
#KNN model
set.seed(1)
knn.model <- knn(train = as.matrix(weekly_train$Lag2), test = as.matrix(weekly_test$Lag2), cl = weekly_train$Direction_dummy, k = 1)
predclass_knn <- ifelse(knn.model == 1, "Up", "Down")
confusionMatrix(as.factor(predclass_knn), weekly_test$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 30
## Up 22 31
##
## Accuracy : 0.5
## 95% CI : (0.4003, 0.5997)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9700
##
## Kappa : -0.0033
##
## Mcnemar's Test P-Value : 0.3317
##
## Sensitivity : 0.5082
## Specificity : 0.4884
## Pos Pred Value : 0.5849
## Neg Pred Value : 0.4118
## Prevalence : 0.5865
## Detection Rate : 0.2981
## Detection Prevalence : 0.5096
## Balanced Accuracy : 0.4983
##
## 'Positive' Class : Up
##
nb.model = naiveBayes(Direction~Lag2 ,data=weekly_train)
predictions.nb = predict(nb.model, weekly_test)
caret::confusionMatrix(as.factor(predictions.nb), weekly_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5865
## Prevalence : 0.4135
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Down
##
The logistic regression model appears to provide the best results with being able to correctly predict the outcome 62.5% of the time.
#KNN Model 2
set.seed(1)
knn.model2 <- knn(train = as.matrix(weekly_train$Lag2), test = as.matrix(weekly_test$Lag2), cl = weekly_train$Direction_dummy, k = 4)
predclass_knn2 <- ifelse(knn.model2 == 1, "Up", "Down")
confusionMatrix(as.factor(predclass_knn2), weekly_test$Direction, positive = "Up")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 20 17
## Up 23 44
##
## Accuracy : 0.6154
## 95% CI : (0.5149, 0.7091)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.3110
##
## Kappa : 0.1903
##
## Mcnemar's Test P-Value : 0.4292
##
## Sensitivity : 0.7213
## Specificity : 0.4651
## Pos Pred Value : 0.6567
## Neg Pred Value : 0.5405
## Prevalence : 0.5865
## Detection Rate : 0.4231
## Detection Prevalence : 0.6442
## Balanced Accuracy : 0.5932
##
## 'Positive' Class : Up
##
#LDA Model 2
lda.model2 = lda(Direction ~ Lag2^2, data = weekly_train)
predictions.lda2 = predict(lda.model2, weekly_test)
caret::confusionMatrix(as.factor(predictions.lda2$class), weekly_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.20930
## Specificity : 0.91803
## Pos Pred Value : 0.64286
## Neg Pred Value : 0.62222
## Prevalence : 0.41346
## Detection Rate : 0.08654
## Detection Prevalence : 0.13462
## Balanced Accuracy : 0.56367
##
## 'Positive' Class : Down
##
qda.model2 = qda(Direction ~ Lag2^2, data = weekly_train)
predictions.qda2 = predict(qda.model2, weekly_test)
caret::confusionMatrix(as.factor(predictions.qda2$class), weekly_test$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5865
## Prevalence : 0.4135
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Down
##
Increasing K to 4 improved the accuracy from 50% to 61.5% for the KNN model. I tried squaring lag2 for QDA and LDA which didn’t change things.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
auto = data.frame(Auto)
auto$mpg01 = ifelse(auto$mpg > median(auto$mpg), 1, 0)
pairs(auto)
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
## mpg01
## Min. :0.0
## 1st Qu.:0.0
## Median :0.5
## Mean :0.5
## 3rd Qu.:1.0
## Max. :1.0
##
str(auto)
## 'data.frame': 392 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ mpg01 : num 0 0 0 0 0 0 0 0 0 0 ...
# Correlation
auto_num <- dplyr::select_if(auto, is.numeric)
corrplot(cor(auto_num), method = c("number"))
It looks like cylinder, displacement, and weight are most correlated and
horsepower and origin are also correlated (but a little less so).
set.seed(1)
index = sample(nrow(auto), 0.8*nrow(auto), replace = F) # 80/20 split
auto_train = auto[index,]
auto_test = auto[-index,]
auto.lda = lda(mpg01 ~ cylinders + displacement + weight + horsepower + origin, data= auto_train)
predictions.lda.auto = predict(auto.lda, auto_test)
caret::confusionMatrix(as.factor(predictions.lda.auto$class), as.factor(auto_test$mpg01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 0
## 1 7 37
##
## Accuracy : 0.9114
## 95% CI : (0.8259, 0.9636)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 2.819e-13
##
## Kappa : 0.8241
##
## Mcnemar's Test P-Value : 0.02334
##
## Sensitivity : 0.8333
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.8409
## Prevalence : 0.5316
## Detection Rate : 0.4430
## Detection Prevalence : 0.4430
## Balanced Accuracy : 0.9167
##
## 'Positive' Class : 0
##
The error for the LDA model is 8.86%.
auto.qda = qda(mpg01 ~ cylinders + displacement + weight + horsepower + origin, data= auto_train)
predictions.qda.auto = predict(auto.qda, auto_test)
caret::confusionMatrix(as.factor(predictions.qda.auto$class), as.factor(auto_test$mpg01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 2
## 1 6 35
##
## Accuracy : 0.8987
## 95% CI : (0.8102, 0.9553)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 2.278e-12
##
## Kappa : 0.798
##
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.8571
## Specificity : 0.9459
## Pos Pred Value : 0.9474
## Neg Pred Value : 0.8537
## Prevalence : 0.5316
## Detection Rate : 0.4557
## Detection Prevalence : 0.4810
## Balanced Accuracy : 0.9015
##
## 'Positive' Class : 0
##
The error for the QDA model is 10.13%.
auto.log = glm(formula = mpg01 ~ cylinders + displacement + weight + horsepower + origin,data = auto_train, family = binomial)
summary(auto.log)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + weight + horsepower +
## origin, family = binomial, data = auto_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.4585601 1.9158183 5.981 2.22e-09 ***
## cylinders 0.0253531 0.3773978 0.067 0.9464
## displacement -0.0096120 0.0098606 -0.975 0.3297
## weight -0.0022131 0.0007556 -2.929 0.0034 **
## horsepower -0.0391871 0.0159680 -2.454 0.0141 *
## origin 0.0983464 0.3140504 0.313 0.7542
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 433.83 on 312 degrees of freedom
## Residual deviance: 175.03 on 307 degrees of freedom
## AIC: 187.03
##
## Number of Fisher Scoring iterations: 7
auto.log2 = glm(formula = mpg01 ~ weight + horsepower,data = auto_train, family = binomial)
summary(auto.log2)
##
## Call:
## glm(formula = mpg01 ~ weight + horsepower, family = binomial,
## data = auto_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.3163967 1.6294218 8.172 3.02e-16 ***
## weight -0.0032154 0.0005163 -6.227 4.74e-10 ***
## horsepower -0.0428774 0.0146233 -2.932 0.00337 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 433.83 on 312 degrees of freedom
## Residual deviance: 178.52 on 310 degrees of freedom
## AIC: 184.52
##
## Number of Fisher Scoring iterations: 7
predprob_log_auto <- predict.glm(auto.log2, auto_test, type = "response")
predclass_log_auto = ifelse(predprob_log_auto >= 0.5, 1, 0)
caret::confusionMatrix(as.factor(predclass_log_auto), as.factor(auto_test$mpg01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 4
## 1 6 33
##
## Accuracy : 0.8734
## 95% CI : (0.7795, 0.9376)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 1.017e-10
##
## Kappa : 0.7466
##
## Mcnemar's Test P-Value : 0.7518
##
## Sensitivity : 0.8919
## Specificity : 0.8571
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.9000
## Prevalence : 0.4684
## Detection Rate : 0.4177
## Detection Prevalence : 0.4937
## Balanced Accuracy : 0.8745
##
## 'Positive' Class : 1
##
The error for the logistic model is 12.66%
auto.nb = naiveBayes(mpg01~ cylinders + displacement + weight + horsepower + origin,data=auto_train)
predictions.nb.auto = predict(auto.nb, auto_test)
caret::confusionMatrix(as.factor(predictions.nb.auto), as.factor(auto_test$mpg01))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 36 1
## 1 6 36
##
## Accuracy : 0.9114
## 95% CI : (0.8259, 0.9636)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 2.819e-13
##
## Kappa : 0.8235
##
## Mcnemar's Test P-Value : 0.1306
##
## Sensitivity : 0.8571
## Specificity : 0.9730
## Pos Pred Value : 0.9730
## Neg Pred Value : 0.8571
## Prevalence : 0.5316
## Detection Rate : 0.4557
## Detection Prevalence : 0.4684
## Balanced Accuracy : 0.9151
##
## 'Positive' Class : 0
##
8.86% error for the naive Bayes model.
set.seed(1)
knn.model.auto <- knn(train = auto_train[, c("cylinders", "displacement", "weight", "horsepower", "origin")], test = auto_test[, c("cylinders", "displacement", "weight", "horsepower", "origin")], cl = auto_train$mpg01, k = 3)
predclass_knn_auto <- ifelse(knn.model.auto == 1, 1, 0)
confusionMatrix(as.factor(predclass_knn_auto), as.factor(auto_test$mpg01), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 38 4
## 1 4 33
##
## Accuracy : 0.8987
## 95% CI : (0.8102, 0.9553)
## No Information Rate : 0.5316
## P-Value [Acc > NIR] : 2.278e-12
##
## Kappa : 0.7967
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8919
## Specificity : 0.9048
## Pos Pred Value : 0.8919
## Neg Pred Value : 0.9048
## Prevalence : 0.4684
## Detection Rate : 0.4177
## Detection Prevalence : 0.4684
## Balanced Accuracy : 0.8983
##
## 'Positive' Class : 1
##
# KNN - Accuracy (sensitivity / specificity)
# 1 - 0.8734
# 2 - 0.8608
# 3 - 0.8987 (.8919 / .9048)
# 4 - 0.8987 (.8649 / .9286)
# 5 - 0.8987 (.8919 / .9048)
# 6 - 0.8987 (.9189 / .8810)
# 7 - 0.8987 (.9189 / .8810)
# 8 - 0.8861
Obtained 10.13% error for the KNN model. The K value of 3 seemed to perform the best.
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
View(Boston)
#create response variable
boston = data.frame(Boston)
boston$crime_rate = ifelse(boston$crim > median(boston$crim), 1, 0)
pairs(boston)
summary(boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv crime_rate
## Min. : 1.73 Min. : 5.00 Min. :0.0
## 1st Qu.: 6.95 1st Qu.:17.02 1st Qu.:0.0
## Median :11.36 Median :21.20 Median :0.5
## Mean :12.65 Mean :22.53 Mean :0.5
## 3rd Qu.:16.95 3rd Qu.:25.00 3rd Qu.:1.0
## Max. :37.97 Max. :50.00 Max. :1.0
str(boston)
## 'data.frame': 506 obs. of 15 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio : num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## $ crime_rate: num 0 0 0 0 0 0 0 0 0 0 ...
# Correlation
corrplot(cor(boston), method = c("color"))
#boston$chas = as.factor(boston$chas)
#boston$crime_rate = as.factor(boston$crime_rate)
### Splitting into test and train
set.seed(1)
ind = sample(nrow(boston), 0.8*nrow(boston), replace = F)
boston_train = boston[ind,]
boston_test = boston[-ind,]
# Logistic Regression
b1 = glm(formula = crime_rate ~ . -crim, data = boston_train, family = binomial)
summary(b1)
##
## Call:
## glm(formula = crime_rate ~ . - crim, family = binomial, data = boston_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -42.886750 7.627711 -5.622 1.88e-08 ***
## zn -0.106512 0.042954 -2.480 0.013149 *
## indus -0.064955 0.051562 -1.260 0.207759
## chas 0.473349 0.786069 0.602 0.547059
## nox 56.659757 9.246096 6.128 8.90e-10 ***
## rm -0.116177 0.846737 -0.137 0.890868
## age 0.021587 0.013568 1.591 0.111610
## dis 1.068641 0.278562 3.836 0.000125 ***
## rad 0.696528 0.175753 3.963 7.40e-05 ***
## tax -0.007585 0.003075 -2.466 0.013646 *
## ptratio 0.362327 0.148310 2.443 0.014564 *
## black -0.010208 0.005643 -1.809 0.070462 .
## lstat 0.078033 0.055794 1.399 0.161939
## medv 0.165209 0.080364 2.056 0.039806 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.06 on 403 degrees of freedom
## Residual deviance: 159.83 on 390 degrees of freedom
## AIC: 187.83
##
## Number of Fisher Scoring iterations: 9
vif(b1)
## zn indus chas nox rm age dis rad
## 2.361049 2.895022 1.300142 5.296846 6.570467 2.594352 5.457461 2.128960
## tax ptratio black lstat medv
## 1.863511 2.288799 1.063539 2.594020 9.685458
# log model #1
predprob_log_boston <- predict.glm(b1, boston_test, type = "response")
predclass_log_boston = ifelse(predprob_log_boston >= 0.5,yes = 1,0)
caret::confusionMatrix(as.factor(predclass_log_boston), as.factor(boston_test$crime_rate), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 43 5
## 1 8 46
##
## Accuracy : 0.8725
## 95% CI : (0.7919, 0.9304)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 2.151e-15
##
## Kappa : 0.7451
##
## Mcnemar's Test P-Value : 0.5791
##
## Sensitivity : 0.9020
## Specificity : 0.8431
## Pos Pred Value : 0.8519
## Neg Pred Value : 0.8958
## Prevalence : 0.5000
## Detection Rate : 0.4510
## Detection Prevalence : 0.5294
## Balanced Accuracy : 0.8725
##
## 'Positive' Class : 1
##
# Stepwise Selection with AIC
null_model = glm(crime_rate ~ 1, data = boston_train, family = binomial)
full_model = b1
step.model.AIC = step(null_model, scope = list(upper = full_model),
direction = "both", test = "Chisq", trace = F)
summary(step.model.AIC)
##
## Call:
## glm(formula = crime_rate ~ nox + rad + tax + ptratio + dis +
## zn + medv + age + black, family = binomial, data = boston_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.519412 7.216180 -5.477 4.34e-08 ***
## nox 51.313423 8.113413 6.325 2.54e-10 ***
## rad 0.771994 0.163016 4.736 2.18e-06 ***
## tax -0.008798 0.002901 -3.033 0.002422 **
## ptratio 0.353281 0.135071 2.616 0.008909 **
## dis 1.009916 0.270475 3.734 0.000189 ***
## zn -0.105917 0.039137 -2.706 0.006803 **
## medv 0.128740 0.040097 3.211 0.001324 **
## age 0.026176 0.011561 2.264 0.023557 *
## black -0.010232 0.005731 -1.786 0.074176 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.06 on 403 degrees of freedom
## Residual deviance: 163.54 on 394 degrees of freedom
## AIC: 183.54
##
## Number of Fisher Scoring iterations: 9
# Best model based on stepwise
b2 <- glm(crime_rate ~ nox + rad + tax + ptratio + dis + zn + medv + age + black, boston_train, family = binomial)
# log model #2
predprob_log_boston2 <- predict.glm(b2, boston_test, type = "response")
predclass_log_boston2 = ifelse(predprob_log_boston2 >= 0.5,yes = 1,0)
caret::confusionMatrix(as.factor(predclass_log_boston2), as.factor(boston_test$crime_rate), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 45 6
## 1 6 45
##
## Accuracy : 0.8824
## 95% CI : (0.8035, 0.9377)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 3.063e-16
##
## Kappa : 0.7647
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8824
## Specificity : 0.8824
## Pos Pred Value : 0.8824
## Neg Pred Value : 0.8824
## Prevalence : 0.5000
## Detection Rate : 0.4412
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8824
##
## 'Positive' Class : 1
##
# LDA
boston.lda = lda(crime_rate ~ nox + rad + tax + ptratio + dis + zn + medv + age + black, data= boston_train)
predictions.lda.boston = predict(boston.lda, boston_test)
caret::confusionMatrix(as.factor(predictions.lda.boston$class), as.factor(boston_test$crime_rate))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 50 13
## 1 1 38
##
## Accuracy : 0.8627
## 95% CI : (0.7804, 0.9229)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 1.388e-14
##
## Kappa : 0.7255
##
## Mcnemar's Test P-Value : 0.003283
##
## Sensitivity : 0.9804
## Specificity : 0.7451
## Pos Pred Value : 0.7937
## Neg Pred Value : 0.9744
## Prevalence : 0.5000
## Detection Rate : 0.4902
## Detection Prevalence : 0.6176
## Balanced Accuracy : 0.8627
##
## 'Positive' Class : 0
##
# Naive Bayes
boston.nb = naiveBayes(crime_rate ~ nox + rad + tax + ptratio + dis + zn + medv + age + black,data=boston_train)
predictions.nb.boston = predict(boston.nb, boston_test)
caret::confusionMatrix(as.factor(predictions.nb.boston), as.factor(boston_test$crime_rate))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 47 15
## 1 4 36
##
## Accuracy : 0.8137
## 95% CI : (0.7245, 0.884)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 5.079e-11
##
## Kappa : 0.6275
##
## Mcnemar's Test P-Value : 0.02178
##
## Sensitivity : 0.9216
## Specificity : 0.7059
## Pos Pred Value : 0.7581
## Neg Pred Value : 0.9000
## Prevalence : 0.5000
## Detection Rate : 0.4608
## Detection Prevalence : 0.6078
## Balanced Accuracy : 0.8137
##
## 'Positive' Class : 0
##
# KNN
set.seed(1)
knn.model.boston <- knn(train = boston_train[, c("nox", "rad", "tax", "ptratio", "dis", "zn","medv","age","black")], test = boston_test[, c("nox", "rad", "tax", "ptratio", "dis", "zn","medv","age","black")], cl = boston_train$crime_rate, k = 2)
predclass_knn_boston <- ifelse(knn.model.boston == 1, 1, 0)
confusionMatrix(as.factor(predclass_knn_boston), as.factor(boston_test$crime_rate), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 45 3
## 1 6 48
##
## Accuracy : 0.9118
## 95% CI : (0.8391, 0.9589)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8235
##
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.9412
## Specificity : 0.8824
## Pos Pred Value : 0.8889
## Neg Pred Value : 0.9375
## Prevalence : 0.5000
## Detection Rate : 0.4706
## Detection Prevalence : 0.5294
## Balanced Accuracy : 0.9118
##
## 'Positive' Class : 1
##
# KNN - Accuracy
# 1 - 0.902
# 2 - 0.9118
# 3 - 0.902
# 4 - 0.902
# 5 - 0.8922
# 6 - 0.8824
# 8 - 0.8725
The model with the highest accuracy was the KNN model when K=2. This model has an accuracy of 91.18%. According to stepwise selection, the best model includes nox, rad, tax, ptratio, dis, zn, medv, age, and black.