10. This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.5.3
attach(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
pairs(Weekly)

Based on the correlation graphs, there appears to be a positive relationship between year and volume (volume of shares traded). From 1990 to 2010, the volume of stock shares increased as the years progressed.
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
probability <- predict(fit.direction, type = "response")
predict.direction <- rep("Down", length(probability))
predict.direction[probability > 0.5] <- "Up"
table(predict.direction, Direction)
## Direction
## predict.direction Down Up
## Down 54 48
## Up 430 557
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010)
train <- Weekly[Weekly$Year <= 2008,]
test <- Weekly[Weekly$Year > 2008,]
lag2.glm <- glm(Direction ~ Lag2, family = 'binomial', data = train)
summary(lag2.glm)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
probability2 <- predict(lag2.glm, newdata = test, type = 'response')
predict.glm2 <- ifelse(probability2 >= 0.5, 'Up', 'Down')
table(predict.glm2, test$Direction)
##
## predict.glm2 Down Up
## Down 9 5
## Up 34 56
mean(predict.glm2 != test$Direction)
## [1] 0.375
(e) Repeat (d) using LDA.
library(MASS)
## Warning: package 'MASS' was built under R version 3.5.3
lda.model <- lda(Direction ~ Lag2, data = train)
probability3 <- predict(lda.model, newdata = test)
predict.lda <- probability3$class
table(predict.lda, test$Direction)
##
## predict.lda Down Up
## Down 9 5
## Up 34 56
mean(predict.lda != test$Direction)
## [1] 0.375
(f) Repeat (d) using QDA.
qda.model <- qda(Direction ~ Lag2, data = train)
probability4 <- predict(qda.model, newdata = test)
predict.qda <- probability4$class
table(predict.qda, test$Direction)
##
## predict.qda Down Up
## Down 0 0
## Up 43 61
mean(predict.qda != test$Direction)
## [1] 0.4134615
(g) Repeat (d) using KNN with K = 1.
library(class)
knn.model <- knn(train = data.frame(train$Lag2),
test = data.frame(test$Lag2),
cl = train$Direction, k = 1)
table(knn.model, test$Direction)
##
## knn.model Down Up
## Down 21 30
## Up 22 31
mean(knn.model != test$Direction)
## [1] 0.5
(h) Which of these methods appears to provide the best results on this data?
The Logistic Regression and LDA model gave the best results with the lowest error rates of 37.5%.
(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
lag2.glm <- glm(Direction ~ Lag2:Lag1, family = 'binomial', data = train)
probability2 <- predict(lag2.glm, newdata = test, type = 'response')
predict.glm2 <- ifelse(probability2 >= 0.5, 'Up', 'Down')
table(predict.glm2, test$Direction)
##
## predict.glm2 Down Up
## Down 1 1
## Up 42 60
mean(predict.glm2 != test$Direction)
## [1] 0.4134615
lda.model <- lda(Direction ~ Lag2:Lag1, data = train)
probability3 <- predict(lda.model, newdata = test)
predict.lda <- probability3$class
table(predict.lda, test$Direction)
##
## predict.lda Down Up
## Down 0 1
## Up 43 60
mean(predict.lda != test$Direction)
## [1] 0.4230769
qda.model <- qda(Direction ~ Lag2:Lag1, data = train)
probability4 <- predict(qda.model, newdata = test)
predict.qda <- probability4$class
table(predict.qda, test$Direction)
##
## predict.qda Down Up
## Down 16 32
## Up 27 29
mean(predict.qda != test$Direction)
## [1] 0.5673077
knn.model <- knn(train = data.frame(train$Lag2),
test = data.frame(test$Lag2),
cl = train$Direction, k = 10)
table(knn.model, test$Direction)
##
## knn.model Down Up
## Down 19 19
## Up 24 42
mean(knn.model != test$Direction)
## [1] 0.4134615
11. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
(c) Split the data into a training set and a test set.
train <- (year %% 2 == 0) #split data into training set
Auto.train <- Auto[train, ]
Auto.test <- Auto[!train, ]
mpg01.test <- mpg01[!train]
13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
Set-Up
library(MASS)
attach(Boston)
crim01 <- rep(0, length(crim)) #make crim binary 0 or 1
crim01[crim > median(crim)] <- 1 #split into 0 or 1 by median
Boston <- data.frame(Boston, crim01) #incorportate crim01 (binary) into data set
train <- 1:(length(crim) / 2) #first half of data into training set
test <- (length(crim) / 2 + 1):length(crim)
Boston.train <- Boston[train, ]
Boston.test <- Boston[test, ]
crim01.test <- crim01[test]
GLM
glm.fit <- glm(crim01 ~ . - crim01 - crim, family = binomial, data = Boston, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.prob <- predict(glm.fit, Boston.test, type = "response")
glm.predict <- rep(0, length(glm.prob))
glm.predict[glm.prob > 0.5] <- 1
table(glm.predict, crim01.test)
## crim01.test
## glm.predict 0 1
## 0 68 24
## 1 22 139
mean(glm.predict != crim01.test) #error rate for glm
## [1] 0.1818182
glm.fit <- glm(crim01 ~ . - crim01 - crim - zn - nox - dis, family = binomial, data = Boston, subset = train)
glm.prob <- predict(glm.fit, Boston.test, type = "response")
glm.predict <- rep(0, length(glm.prob))
glm.predict[glm.prob > 0.5] <- 1
table(glm.predict, crim01.test)
## crim01.test
## glm.predict 0 1
## 0 85 30
## 1 5 133
mean(glm.predict != crim01.test) #error rate for glm -zn -nox -dis
## [1] 0.1383399
LDA
lda.fit <- lda(crim01 ~ . - crim01 - crim, data = Boston, subset = train)
lda.predict <- predict(lda.fit, Boston.test)
table(lda.predict$class, crim01.test)
## crim01.test
## 0 1
## 0 80 24
## 1 10 139
mean(lda.predict$class != crim01.test) #error rate for lda
## [1] 0.1343874
lda.fit <- lda(crim01 ~ . - crim01 - crim - zn - nox - dis, data = Boston, subset = train)
lda.predict <- predict(lda.fit, Boston.test)
table(lda.predict$class, crim01.test)
## crim01.test
## 0 1
## 0 85 30
## 1 5 133
mean(lda.predict$class != crim01.test) #error rate for lda -zn -nox -dis
## [1] 0.1383399
KNN
train.X <- cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[train, ]
test.X <- cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[test, ]
train.crim01 <- crim01[train]
set.seed(1)
knn.predict <- knn(train.X, test.X, train.crim01, k = 1)
table(knn.predict, crim01.test)
## crim01.test
## knn.predict 0 1
## 0 85 111
## 1 5 52
mean(knn.predict != crim01.test) #error rate when k=1
## [1] 0.458498
knn.predict <- knn(train.X, test.X, train.crim01, k = 10)
table(knn.predict, crim01.test)
## crim01.test
## knn.predict 0 1
## 0 83 23
## 1 7 140
mean(knn.predict != crim01.test) #error rate when k=10
## [1] 0.1185771
Our resluts show that the KNN with k=10 gives the lowest error rate of 11.86%. We were also able to lower the error rates of the GLM by removing the variables zn (proportion of residential land zoned for lots over 25,000 sq.ft), nox (nitrogen oxides concentration) and ds (weighted mean of distances to five Boston employment centres) with an error rate of 13.83% rather than 18.18%. Removing these variables did not have much of a difference in the error rates of the LDA models.