Chapter 04 (page 168): 10, 11, 13
This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
attach(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
pairs(Weekly)
cor(Weekly[-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
Based on the scatterplot matrix, wer can see that Volume and Year have a positive relationship. with each additional year, trading volume has been increasing. Additionally, It appears that Today and Direction have a relationship - whether the market moves up or down may be related to the percentage return for the current week.
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
glm.fit.weekly.full = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data = Weekly, family = binomial)
glm.fit.weekly.full
##
## Call: glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag3 Lag4
## 0.26686 -0.04127 0.05844 -0.01606 -0.02779
## Lag5 Volume
## -0.01447 -0.02274
##
## Degrees of Freedom: 1088 Total (i.e. Null); 1082 Residual
## Null Deviance: 1496
## Residual Deviance: 1486 AIC: 1500
summary(glm.fit.weekly.full)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 is significant with a p-value of .0296.
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
glm.probs.weekly.full=predict(glm.fit.weekly.full,type='response')
glm.pred.weekly.full=rep("Down",1089)
glm.pred.weekly.full[glm.probs.weekly.full>0.5] = "Up"
table(glm.pred.weekly.full,Direction)
## Direction
## glm.pred.weekly.full Down Up
## Down 54 48
## Up 430 557
The above table shows that the model does well on days when the market is up, correctly identifying 557 “up” days out of 605 total “up” days. However, the model does very poorly on days the market went down, identifying only 54 correctly out of 484 total.
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train=(Year<2009)
test.weekly = Weekly[!train,]
Direction.test.weekly=Direction[!train]
glm.fit.weekly=glm(Direction~Lag2, data=Smarket, subset=train, family = binomial)
glm.probs.weekly=predict(glm.fit.weekly,test.weekly, type ='response')
length(glm.probs.weekly)
## [1] 104
glm.pred.weekly=rep('Down',104)
glm.pred.weekly[glm.probs.weekly>0.5] = 'Up'
table(glm.pred.weekly,Direction.test.weekly)
## Direction.test.weekly
## glm.pred.weekly Down Up
## Down 18 24
## Up 25 37
mean(glm.pred.weekly==Direction.test.weekly)
## [1] 0.5288462
(e) Repeat (d) using LDA.
library(MASS)
lda.fit.weekly=lda(Direction~Lag2, data=Weekly, subset=train)
lda.fit.weekly
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
plot(lda.fit.weekly)
lda.pred.weekly=predict(lda.fit.weekly, test.weekly)
names(lda.pred.weekly)
## [1] "class" "posterior" "x"
lda.class.weekly=lda.pred.weekly$class
table(lda.class.weekly,Direction.test.weekly)
## Direction.test.weekly
## lda.class.weekly Down Up
## Down 9 5
## Up 34 56
mean(lda.class.weekly==Direction.test.weekly)
## [1] 0.625
(f) Repeat (d) using QDA.
qda.fit.weekly=qda(Direction~Lag2, data=Weekly, subset=train)
qda.fit.weekly
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.class.weekly=predict(qda.fit.weekly,test.weekly)$class
table(qda.class.weekly,Direction.test.weekly)
## Direction.test.weekly
## qda.class.weekly Down Up
## Down 0 0
## Up 43 61
mean(qda.class.weekly==Direction.test.weekly)
## [1] 0.5865385
(g) Repeat (d) using KNN with K=1.
library(class)
train.x=as.matrix(Lag2[train])
test.x=as.matrix(Lag2[!train])
train.Direction=Direction[train]
dim(train.x)
## [1] 985 1
dim(test.x)
## [1] 104 1
length(train.Direction)
## [1] 985
set.seed(1)
knn.pred=knn(train.x,test.x,train.Direction,k=1)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred==Direction.test.weekly)
## [1] 0.5
(h) Which of these methods appears to provide the best results on this data?
The LDA method appears to provide the best results - it has the highest percentage of correct predictions at 62.5%
(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
glm.fit.weekly.1 = glm(Direction~Lag2+Year*Volume, data = Weekly, family = binomial)
glm.fit.weekly.1
##
## Call: glm(formula = Direction ~ Lag2 + Year * Volume, family = binomial,
## data = Weekly)
##
## Coefficients:
## (Intercept) Lag2 Year Volume Year:Volume
## 9.476345 0.059935 -0.004581 -29.828896 0.014842
##
## Degrees of Freedom: 1088 Total (i.e. Null); 1084 Residual
## Null Deviance: 1496
## Residual Deviance: 1489 AIC: 1499
summary(glm.fit.weekly.1)
##
## Call:
## glm(formula = Direction ~ Lag2 + Year * Volume, family = binomial,
## data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.517 -1.262 1.004 1.082 1.422
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.476345 42.550560 0.223 0.8238
## Lag2 0.059935 0.026909 2.227 0.0259 *
## Year -0.004581 0.021359 -0.214 0.8302
## Volume -29.828896 43.184789 -0.691 0.4897
## Year:Volume 0.014842 0.021473 0.691 0.4894
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1489.4 on 1084 degrees of freedom
## AIC: 1499.4
##
## Number of Fisher Scoring iterations: 4
glm.probs.weekly.1=predict(glm.fit.weekly.1,test.weekly, type ='response')
glm.pred.weekly.1=rep('Down',104)
glm.pred.weekly.1[glm.probs.weekly.1>0.5] = 'Up'
table(glm.pred.weekly.1,Direction.test.weekly)
## Direction.test.weekly
## glm.pred.weekly.1 Down Up
## Down 6 4
## Up 37 57
mean(glm.pred.weekly.1==Direction.test.weekly)
## [1] 0.6057692
glm.fit.weekly.2 = glm(Direction~Lag1*Lag2, data = Weekly, family = binomial)
glm.fit.weekly.2
##
## Call: glm(formula = Direction ~ Lag1 * Lag2, family = binomial, data = Weekly)
##
## Coefficients:
## (Intercept) Lag1 Lag2 Lag1:Lag2
## 0.221393 -0.035783 0.059931 0.002018
##
## Degrees of Freedom: 1088 Total (i.e. Null); 1085 Residual
## Null Deviance: 1496
## Residual Deviance: 1488 AIC: 1496
summary(glm.fit.weekly.2)
##
## Call:
## glm(formula = Direction ~ Lag1 * Lag2, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.579 -1.263 1.004 1.081 1.572
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.221393 0.061474 3.601 0.000317 ***
## Lag1 -0.035783 0.027992 -1.278 0.201139
## Lag2 0.059931 0.026649 2.249 0.024521 *
## Lag1:Lag2 0.002018 0.006729 0.300 0.764207
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1488.1 on 1085 degrees of freedom
## AIC: 1496.1
##
## Number of Fisher Scoring iterations: 4
glm.probs.weekly.2=predict(glm.fit.weekly.2,test.weekly, type ='response')
glm.pred.weekly.2=rep('Down',104)
glm.pred.weekly.2[glm.probs.weekly.2>0.5] = 'Up'
table(glm.pred.weekly.2,Direction.test.weekly)
## Direction.test.weekly
## glm.pred.weekly.2 Down Up
## Down 5 6
## Up 38 55
mean(glm.pred.weekly.2==Direction.test.weekly)
## [1] 0.5769231
glm.fit.weekly.3 = glm(Direction~Lag2*Volume, data = Weekly, family = binomial)
glm.fit.weekly.3
##
## Call: glm(formula = Direction ~ Lag2 * Volume, family = binomial, data = Weekly)
##
## Coefficients:
## (Intercept) Lag2 Volume Lag2:Volume
## 0.23724 0.05061 -0.01297 0.00415
##
## Degrees of Freedom: 1088 Total (i.e. Null); 1085 Residual
## Null Deviance: 1496
## Residual Deviance: 1490 AIC: 1498
summary(glm.fit.weekly.3)
##
## Call:
## glm(formula = Direction ~ Lag2 * Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.589 -1.269 1.015 1.084 1.474
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.23724 0.08418 2.818 0.00483 **
## Lag2 0.05061 0.03911 1.294 0.19560
## Volume -0.01297 0.03668 -0.354 0.72363
## Lag2:Volume 0.00415 0.01068 0.389 0.69764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1490.1 on 1085 degrees of freedom
## AIC: 1498.1
##
## Number of Fisher Scoring iterations: 4
glm.probs.weekly.3=predict(glm.fit.weekly.3,test.weekly, type ='response')
glm.pred.weekly.3=rep('Down',104)
glm.pred.weekly.3[glm.probs.weekly.3>0.5] = 'Up'
table(glm.pred.weekly.3,Direction.test.weekly)
## Direction.test.weekly
## glm.pred.weekly.3 Down Up
## Down 10 6
## Up 33 55
mean(glm.pred.weekly.3==Direction.test.weekly)
## [1] 0.625
library(MASS)
lda.fit.weekly.1=lda(Direction~Lag2+Year*Volume, data=Weekly, subset=train)
plot(lda.fit.weekly.1)
lda.pred.weekly.1=predict(lda.fit.weekly.1, test.weekly)
names(lda.pred.weekly.1)
## [1] "class" "posterior" "x"
lda.class.weekly.1=lda.pred.weekly.1$class
table(lda.class.weekly.1,Direction.test.weekly)
## Direction.test.weekly
## lda.class.weekly.1 Down Up
## Down 30 42
## Up 13 19
mean(lda.class.weekly.1==Direction.test.weekly)
## [1] 0.4711538
lda.fit.weekly.2=lda(Direction~Lag1*Lag2, data=Weekly, subset=train)
plot(lda.fit.weekly.2)
lda.pred.weekly.2=predict(lda.fit.weekly.2, test.weekly)
names(lda.pred.weekly.2)
## [1] "class" "posterior" "x"
lda.class.weekly.2=lda.pred.weekly.2$class
table(lda.class.weekly.2,Direction.test.weekly)
## Direction.test.weekly
## lda.class.weekly.2 Down Up
## Down 7 8
## Up 36 53
mean(lda.class.weekly.2==Direction.test.weekly)
## [1] 0.5769231
lda.fit.weekly.3=lda(Direction~Lag2*Volume, data=Weekly, subset=train)
plot(lda.fit.weekly.3)
lda.pred.weekly.3=predict(lda.fit.weekly.3, test.weekly)
names(lda.pred.weekly.3)
## [1] "class" "posterior" "x"
lda.class.weekly.3=lda.pred.weekly.3$class
table(lda.class.weekly.3,Direction.test.weekly)
## Direction.test.weekly
## lda.class.weekly.3 Down Up
## Down 20 25
## Up 23 36
mean(lda.class.weekly.3==Direction.test.weekly)
## [1] 0.5384615
qda.fit.weekly.1=qda(Direction~Lag2+Year*Volume, data=Weekly, subset=train)
qda.class.weekly.1=predict(qda.fit.weekly.1,test.weekly)$class
table(qda.class.weekly.1,Direction.test.weekly)
## Direction.test.weekly
## qda.class.weekly.1 Down Up
## Down 28 34
## Up 15 27
mean(qda.class.weekly.1==Direction.test.weekly)
## [1] 0.5288462
qda.fit.weekly.2=qda(Direction~Lag1*Lag2, data=Weekly, subset=train)
qda.class.weekly.2=predict(qda.fit.weekly.2,test.weekly)$class
table(qda.class.weekly.2,Direction.test.weekly)
## Direction.test.weekly
## qda.class.weekly.2 Down Up
## Down 23 36
## Up 20 25
mean(qda.class.weekly.2==Direction.test.weekly)
## [1] 0.4615385
qda.fit.weekly.3=qda(Direction~Lag2*Volume, data=Weekly, subset=train)
qda.class.weekly.3=predict(qda.fit.weekly.3,test.weekly)$class
table(qda.class.weekly.3,Direction.test.weekly)
## Direction.test.weekly
## qda.class.weekly.3 Down Up
## Down 37 49
## Up 6 12
mean(qda.class.weekly.3==Direction.test.weekly)
## [1] 0.4711538
library(class)
train.x=cbind(Lag2,Year,Volume)[train,]
test.x=cbind(Lag2,Year,Volume)[!train,]
train.Direction=Direction[train]
length(train.Direction)
## [1] 985
set.seed(1)
knn.pred=knn(train.x,test.x,train.Direction,k=1)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 29 32
## Up 14 29
mean(knn.pred==Direction.test.weekly)
## [1] 0.5576923
knn.pred=knn(train.x,test.x,train.Direction,k=3)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 33 37
## Up 10 24
mean(knn.pred==Direction.test.weekly)
## [1] 0.5480769
knn.pred=knn(train.x,test.x,train.Direction,k=5)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 29 36
## Up 14 25
mean(knn.pred==Direction.test.weekly)
## [1] 0.5192308
train.x=cbind(Lag2,Lag1)[train,]
test.x=cbind(Lag2,Lag1)[!train,]
train.Direction=Direction[train]
length(train.Direction)
## [1] 985
set.seed(1)
knn.pred=knn(train.x,test.x,train.Direction,k=1)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 18 29
## Up 25 32
mean(knn.pred==Direction.test.weekly)
## [1] 0.4807692
knn.pred=knn(train.x,test.x,train.Direction,k=3)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 22 29
## Up 21 32
mean(knn.pred==Direction.test.weekly)
## [1] 0.5192308
knn.pred=knn(train.x,test.x,train.Direction,k=5)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 22 32
## Up 21 29
mean(knn.pred==Direction.test.weekly)
## [1] 0.4903846
train.x=cbind(Lag2,Volume)[train,]
test.x=cbind(Lag2,Volume)[!train,]
train.Direction=Direction[train]
length(train.Direction)
## [1] 985
set.seed(1)
knn.pred=knn(train.x,test.x,train.Direction,k=1)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 26 29
## Up 17 32
mean(knn.pred==Direction.test.weekly)
## [1] 0.5576923
knn.pred=knn(train.x,test.x,train.Direction,k=3)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 30 34
## Up 13 27
mean(knn.pred==Direction.test.weekly)
## [1] 0.5480769
knn.pred=knn(train.x,test.x,train.Direction,k=5)
table(knn.pred,Direction.test.weekly)
## Direction.test.weekly
## knn.pred Down Up
## Down 28 34
## Up 15 27
mean(knn.pred==Direction.test.weekly)
## [1] 0.5288462
Of all the additional models that were fit above, the logistic regression model for Direction on Lag2, Volume, and the Lag2*Volume interaction term provided the highest percentage of correct predictions on the test data at 62.5%. This ties the LDA model for Direction on only Lag2.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
attach(Auto)
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using themedian() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
mpg01 = rep(0,length(mpg))
mpg01[mpg > median(mpg)] = 1
auto = data.frame(Auto,mpg01)
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
cor(auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
pairs(auto)
boxplot(cylinders~mpg01, main = 'MPG by Cylinders')
boxplot(displacement~mpg01, main = 'MPG by Displacement')
boxplot(horsepower~mpg01, main = 'MPG by Horsepower')
boxplot(weight~mpg01, main = 'MPG by Weight')
boxplot(acceleration~mpg01, main = 'MPG by Acceleration')
boxplot(year~mpg01, main = 'MPG by Year')
Looking at the correlation matrix for Auto with mpg01 included, we can see that
mpg01 has a strong correlation to several of the Auto variables. Among these, are a strong negative correlation with # of cylinders, engine displacement, horsepower, and vehicle weight. There is a positive correlation with acceleration and year - the newer the car, the more likely it is to have gas mileage greater than the median for the entire dataset.
Since mpg01 is a binary variable, the scatter will not show a trend for mpg01. However, we can see that for horsepower, acceleration, and weight, there is an obvious difference when mpg01 = 1 (greater than median mpg), and when mpg01 = 0 (lower than median mpg).
The boxplot for # of cylinders vs. mpg01 shows that the more cylinders a car has, generally, it will have worse gas mileage. For the most part, most vehicles with higher than median mpg have 4 cylinders. Similarly, cars with higher than median mpg also have samller engines (displacement), less powerful engines (horsepower), and are lighter (weight), than cars with below median mpg. As was previosuly mentioned, higher mpg cars appear to have better acceleration - possibly due to being lighter - and newer cars are more likely to have higher than median mpg.
(c) Split the data into a training set and a test set.
train.auto=(year<79)
test.auto = auto[!train.auto,]
mpg01.test = mpg01[!train.auto]
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
library(MASS)
lda.fit.auto = lda(mpg01~cylinders+weight+displacement+horsepower+year, data = auto, subset = train.auto)
plot(lda.fit.auto)
lda.pred.auto = predict(lda.fit.auto, test.auto)
lda.class.auto = lda.pred.auto$class
table(lda.class.auto,mpg01.test)
## mpg01.test
## lda.class.auto 0 1
## 0 17 15
## 1 1 81
mean(lda.class.auto != mpg01.test)
## [1] 0.1403509
Fitting an LDA model for mpg01 on cylinders, weight, displacement, horsepower, and year resulted in a model with an error percentage of only ~14%.
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda.fit.auto=qda(mpg01~cylinders+weight+displacement+horsepower+year, data=auto, subset=train.auto)
qda.class.auto=predict(qda.fit.auto,test.auto)$class
table(qda.class.auto,mpg01.test)
## mpg01.test
## qda.class.auto 0 1
## 0 17 19
## 1 1 77
mean(qda.class.auto != mpg01.test)
## [1] 0.1754386
Fitting a QDA for mpg01 with cylinders, weight, displacement, horsepower, and year results in a test error of ~17.5%.
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
glm.fit.auto = glm(mpg01~cylinders+weight+displacement+horsepower+year, data=auto, subset=train.auto, family = binomial)
summary(glm.fit.auto)
##
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower +
## year, family = binomial, data = auto, subset = train.auto)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4289 -0.1162 -0.0028 0.1674 2.2970
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.615923 7.761226 -0.079 0.9367
## cylinders -0.465777 0.615851 -0.756 0.4495
## weight -0.003961 0.001325 -2.989 0.0028 **
## displacement -0.005918 0.014769 -0.401 0.6886
## horsepower -0.043138 0.023610 -1.827 0.0677 .
## year 0.241357 0.111657 2.162 0.0306 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 363.21 on 277 degrees of freedom
## Residual deviance: 101.47 on 272 degrees of freedom
## AIC: 113.47
##
## Number of Fisher Scoring iterations: 8
glm.probs.auto = predict(glm.fit.auto,test.auto, type ='response')
glm.pred.auto = rep(0,114)
glm.pred.auto[glm.probs.auto > 0.5] = 1
table(glm.pred.auto,mpg01.test)
## mpg01.test
## glm.pred.auto 0 1
## 0 17 14
## 1 1 82
mean(glm.pred.auto != mpg01.test)
## [1] 0.1315789
Fitting a logistic regression model on cylinders, weight, displacement, horsepower, and year resulted in a error of ~13.2%. However, looking at the p values of the coefficients, only weight and year might be significant. The model will be refit with only these two variables.
glm.fit.auto = glm(mpg01~weight+year, data=auto, subset=train.auto, family = binomial)
summary(glm.fit.auto)
##
## Call:
## glm(formula = mpg01 ~ weight + year, family = binomial, data = auto,
## subset = train.auto)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.37472 -0.14004 -0.00743 0.20230 2.12167
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.2826234 7.2101288 -0.871 0.38356
## weight -0.0062790 0.0008974 -6.997 2.62e-12 ***
## year 0.3065403 0.1086480 2.821 0.00478 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 363.21 on 277 degrees of freedom
## Residual deviance: 107.78 on 275 degrees of freedom
## AIC: 113.78
##
## Number of Fisher Scoring iterations: 8
glm.probs.auto = predict(glm.fit.auto,test.auto, type ='response')
glm.pred.auto = rep(0,114)
glm.pred.auto[glm.probs.auto > 0.5] = 1
table(glm.pred.auto,mpg01.test)
## mpg01.test
## glm.pred.auto 0 1
## 0 17 12
## 1 1 84
mean(glm.pred.auto != mpg01.test)
## [1] 0.1140351
After refitting the model on only weight and year, the test error is now ~11.4%. Additionally, both weight and year are now significant.
(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
library(class)
train.x=cbind(weight,year)[train.auto,]
test.x=cbind(weight,year)[!train.auto,]
train.mpg01=mpg01[train.auto]
dim(train.x)
## [1] 278 2
dim(test.x)
## [1] 114 2
length(train.mpg01)
## [1] 278
set.seed(1)
knn.pred.auto.1=knn(train.x,test.x,train.mpg01,k=1)
table(knn.pred.auto.1,mpg01.test)
## mpg01.test
## knn.pred.auto.1 0 1
## 0 18 27
## 1 0 69
mean(knn.pred.auto.1 != mpg01.test)
## [1] 0.2368421
knn.pred.auto.3=knn(train.x,test.x,train.mpg01,k=3)
table(knn.pred.auto.3,mpg01.test)
## mpg01.test
## knn.pred.auto.3 0 1
## 0 18 33
## 1 0 63
mean(knn.pred.auto.3 != mpg01.test)
## [1] 0.2894737
knn.pred.auto.5=knn(train.x,test.x,train.mpg01,k=5)
table(knn.pred.auto.5,mpg01.test)
## mpg01.test
## knn.pred.auto.5 0 1
## 0 18 33
## 1 0 63
mean(knn.pred.auto.5 != mpg01.test)
## [1] 0.2894737
knn.pred.auto.7=knn(train.x,test.x,train.mpg01,k=7)
table(knn.pred.auto.7,mpg01.test)
## mpg01.test
## knn.pred.auto.7 0 1
## 0 18 27
## 1 0 69
mean(knn.pred.auto.7 != mpg01.test)
## [1] 0.2368421
For this data set, it seems like either k=1 or k=7 performs the best.
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various sub-sets of the predictors. Describe your findings.
library(MASS)
attach(Boston)
crim01 = rep(0,length(crim))
crim01[crim > median(crim)] = 1
bos = data.frame(Boston,crim01)
cor(bos)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## crim01 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## crim01 -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128
## ptratio black lstat medv crim01
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046 0.40939545
## zn -0.3916785 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252 0.60326017
## chas -0.1215152 0.04878848 -0.0539293 0.1752602 0.07009677
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208 0.72323480
## rm -0.3555015 0.12806864 -0.6138083 0.6953599 -0.15637178
## age 0.2615150 -0.27353398 0.6023385 -0.3769546 0.61393992
## dis -0.2324705 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262 0.61978625
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867 0.25356836
## black -0.1773833 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627 0.45326273
## medv -0.5077867 0.33346082 -0.7376627 1.0000000 -0.26301673
## crim01 0.2535684 -0.35121093 0.4532627 -0.2630167 1.00000000
boxplot(tax~crim01)
boxplot(nox~crim01)
boxplot(age~crim01)
boxplot(dis~crim01)
boxplot(indus~crim01)
boxplot(zn~crim01)
Looking at the correlation matrix, crim01 has a strong positive correlation with indus, nox, age, tax, and rad, and a strong negative correlation with dis. Zn also has a fairly strong negative correlation.
Looking at the box plots for several variables compared to the different levels of crim01, it is obvious that there are strong relationships between crim01 and many of the variables in the Boston dataset. One plot to note is the one for zn by crim01. Other than two outliers, there are no neighborhoods with a lot of crime, and a lot of land zoned for 25,000 sq.ft plots.
train.index = sample(1:nrow(bos), .8 * nrow(bos))
test.index = setdiff(1:nrow(bos),train.index)
train.bos = bos[train.index,]
test.bos = bos[test.index,]
crim01.test = crim01[test.index]
glm.fit.bos = glm(crim01~tax+nox+age+dis+indus+zn, data = train.bos,family = binomial)
summary(glm.fit.bos)
##
## Call:
## glm(formula = crim01 ~ tax + nox + age + dis + indus + zn, family = binomial,
## data = train.bos)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.10546 -0.36781 -0.02329 0.43187 2.63422
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.386228 3.176303 -6.418 1.38e-10 ***
## tax 0.004187 0.001542 2.715 0.00663 **
## nox 34.594481 5.833904 5.930 3.03e-09 ***
## age 0.003116 0.008722 0.357 0.72092
## dis 0.326027 0.169313 1.926 0.05416 .
## indus -0.090324 0.038802 -2.328 0.01992 *
## zn -0.049185 0.023998 -2.050 0.04041 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.02 on 403 degrees of freedom
## Residual deviance: 243.17 on 397 degrees of freedom
## AIC: 257.17
##
## Number of Fisher Scoring iterations: 7
glm.probs.bos = predict(glm.fit.bos, test.bos, type ='response')
length(crim01.test)
## [1] 102
glm.pred.bos = rep(0,102)
glm.pred.bos[glm.probs.bos > 0.5] = 1
length(glm.pred.bos)
## [1] 102
table(glm.pred.bos,crim01.test)
## crim01.test
## glm.pred.bos 0 1
## 0 41 5
## 1 8 48
mean(glm.pred.bos == crim01.test)
## [1] 0.872549
Fitting a model for crim01 (crime above or below median crime rate) on tax, nox, age, dis, indus, and zn resulted in a model with a correct prediction rate on the test data of ~92.2%. Looking at the summary of the model, we can see that tax, nox, indus, and zn are potentially significant variables. A new model will be refit with only these variables.
glm.fit.bos.1 = glm(crim01~tax+nox+indus+zn, data = train.bos,family = binomial)
summary(glm.fit.bos.1)
##
## Call:
## glm(formula = crim01 ~ tax + nox + indus + zn, family = binomial,
## data = train.bos)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.08840 -0.35106 -0.04306 0.41263 2.67700
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.454221 2.299192 -7.157 8.27e-13 ***
## tax 0.004087 0.001521 2.688 0.0072 **
## nox 29.895334 4.854787 6.158 7.37e-10 ***
## indus -0.093621 0.038140 -2.455 0.0141 *
## zn -0.035715 0.021202 -1.684 0.0921 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.02 on 403 degrees of freedom
## Residual deviance: 246.93 on 399 degrees of freedom
## AIC: 256.93
##
## Number of Fisher Scoring iterations: 7
glm.probs.bos.1 = predict(glm.fit.bos.1, test.bos, type ='response')
length(crim01.test)
## [1] 102
glm.pred.bos.1 = rep(0,102)
glm.pred.bos.1[glm.probs.bos.1 > 0.5] = 1
length(glm.pred.bos.1)
## [1] 102
table(glm.pred.bos.1,crim01.test)
## crim01.test
## glm.pred.bos.1 0 1
## 0 42 5
## 1 7 48
mean(glm.pred.bos.1 == crim01.test)
## [1] 0.8823529
This model still results in ~92% correct predicitons.
lda.fit.bos = lda(crim01~tax+nox+indus+zn, data = train.bos)
plot(lda.fit.bos)
lda.pred.bos = predict(lda.fit.bos, test.bos)
lda.class.bos = lda.pred.bos$class
table(lda.class.bos,crim01.test)
## crim01.test
## lda.class.bos 0 1
## 0 45 11
## 1 4 42
mean(lda.class.bos == crim01.test)
## [1] 0.8529412
Fitting an LDA model for crim01 with tax, nox, indus, and zn results in a model that correctly predicts ~84.3% of the test values.
qda.fit.bos=qda(crim01~tax+nox+indus+zn, data=train.bos)
qda.class.bos=predict(qda.fit.bos,test.bos)$class
table(qda.class.bos,crim01.test)
## crim01.test
## qda.class.bos 0 1
## 0 38 4
## 1 11 49
mean(qda.class.bos == crim01.test)
## [1] 0.8529412
A QDA model fit on the same variables results in the same percentage of correct predictions ~84.3%.
library(class)
train.x=cbind(tax,nox,indus,zn)[train.index,]
test.x=cbind(tax,nox,indus,zn)[test.index,]
train.crim01=crim01[train.index]
dim(train.x)
## [1] 404 4
dim(test.x)
## [1] 102 4
length(train.crim01)
## [1] 404
set.seed(1)
knn.pred.bos.1=knn(train.x,test.x,train.crim01,k=1)
table(knn.pred.bos.1,crim01.test)
## crim01.test
## knn.pred.bos.1 0 1
## 0 48 3
## 1 1 50
mean(knn.pred.bos.1 == crim01.test)
## [1] 0.9607843
knn.pred.bos.3=knn(train.x,test.x,train.crim01,k=3)
table(knn.pred.bos.3,crim01.test)
## crim01.test
## knn.pred.bos.3 0 1
## 0 46 3
## 1 3 50
mean(knn.pred.bos.3 == crim01.test)
## [1] 0.9411765
knn.pred.bos.5=knn(train.x,test.x,train.crim01,k=5)
table(knn.pred.bos.5,crim01.test)
## crim01.test
## knn.pred.bos.5 0 1
## 0 44 3
## 1 5 50
mean(knn.pred.bos.5 == crim01.test)
## [1] 0.9215686
knn.pred.bos.7=knn(train.x,test.x,train.crim01,k=7)
table(knn.pred.bos.7,crim01.test)
## crim01.test
## knn.pred.bos.7 0 1
## 0 42 3
## 1 7 50
mean(knn.pred.bos.7 == crim01.test)
## [1] 0.9019608
Several KNN models were fit to predict crim01 with tax, nox, indus, and zn. All of the models have very high correct prediction rates, with the model for K=1 having the best with ~96.1%. The models for K = 3, 5, and 7 all had correct prediction rates of ~94%.
All of the fit models had high prediction rates. Of all of the types of models, the KNN models seemed to perform the best. Although the model with K = 1 had the highest prediction rate, one of the other KNN models where K = 3 or K = 5 would likely be a better choice.