Chapter 04 (page 168): 10, 11, 13

#10

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

attach(Weekly)
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260
pairs(Weekly)

cor(Weekly[-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

Based on the scatterplot matrix, wer can see that Volume and Year have a positive relationship. with each additional year, trading volume has been increasing. Additionally, It appears that Today and Direction have a relationship - whether the market moves up or down may be related to the percentage return for the current week.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm.fit.weekly.full = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data = Weekly, family = binomial)
glm.fit.weekly.full
## 
## Call:  glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4  
##     0.26686     -0.04127      0.05844     -0.01606     -0.02779  
##        Lag5       Volume  
##    -0.01447     -0.02274  
## 
## Degrees of Freedom: 1088 Total (i.e. Null);  1082 Residual
## Null Deviance:       1496 
## Residual Deviance: 1486  AIC: 1500
summary(glm.fit.weekly.full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Lag2 is significant with a p-value of .0296.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm.probs.weekly.full=predict(glm.fit.weekly.full,type='response')
glm.pred.weekly.full=rep("Down",1089)
glm.pred.weekly.full[glm.probs.weekly.full>0.5] = "Up"
table(glm.pred.weekly.full,Direction)
##                     Direction
## glm.pred.weekly.full Down  Up
##                 Down   54  48
##                 Up    430 557

The above table shows that the model does well on days when the market is up, correctly identifying 557 “up” days out of 605 total “up” days. However, the model does very poorly on days the market went down, identifying only 54 correctly out of 484 total.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train=(Year<2009)
test.weekly = Weekly[!train,]
Direction.test.weekly=Direction[!train]
glm.fit.weekly=glm(Direction~Lag2, data=Smarket, subset=train, family = binomial)
glm.probs.weekly=predict(glm.fit.weekly,test.weekly, type ='response')
length(glm.probs.weekly)
## [1] 104
glm.pred.weekly=rep('Down',104)
glm.pred.weekly[glm.probs.weekly>0.5] = 'Up'
table(glm.pred.weekly,Direction.test.weekly)
##                Direction.test.weekly
## glm.pred.weekly Down Up
##            Down   18 24
##            Up     25 37
mean(glm.pred.weekly==Direction.test.weekly)
## [1] 0.5288462

(e) Repeat (d) using LDA.

library(MASS)
lda.fit.weekly=lda(Direction~Lag2, data=Weekly, subset=train)
lda.fit.weekly
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162
plot(lda.fit.weekly)

lda.pred.weekly=predict(lda.fit.weekly, test.weekly)
names(lda.pred.weekly)
## [1] "class"     "posterior" "x"
lda.class.weekly=lda.pred.weekly$class
table(lda.class.weekly,Direction.test.weekly)
##                 Direction.test.weekly
## lda.class.weekly Down Up
##             Down    9  5
##             Up     34 56
mean(lda.class.weekly==Direction.test.weekly)
## [1] 0.625

(f) Repeat (d) using QDA.

qda.fit.weekly=qda(Direction~Lag2, data=Weekly, subset=train)
qda.fit.weekly
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
qda.class.weekly=predict(qda.fit.weekly,test.weekly)$class
table(qda.class.weekly,Direction.test.weekly)
##                 Direction.test.weekly
## qda.class.weekly Down Up
##             Down    0  0
##             Up     43 61
mean(qda.class.weekly==Direction.test.weekly)
## [1] 0.5865385

(g) Repeat (d) using KNN with K=1.

library(class)
train.x=as.matrix(Lag2[train])
test.x=as.matrix(Lag2[!train])
train.Direction=Direction[train]
dim(train.x)
## [1] 985   1
dim(test.x)
## [1] 104   1
length(train.Direction)
## [1] 985
set.seed(1)
knn.pred=knn(train.x,test.x,train.Direction,k=1)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn.pred==Direction.test.weekly)
## [1] 0.5

(h) Which of these methods appears to provide the best results on this data?
The LDA method appears to provide the best results - it has the highest percentage of correct predictions at 62.5%

(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

glm.fit.weekly.1 = glm(Direction~Lag2+Year*Volume, data = Weekly, family = binomial)
glm.fit.weekly.1
## 
## Call:  glm(formula = Direction ~ Lag2 + Year * Volume, family = binomial, 
##     data = Weekly)
## 
## Coefficients:
## (Intercept)         Lag2         Year       Volume  Year:Volume  
##    9.476345     0.059935    -0.004581   -29.828896     0.014842  
## 
## Degrees of Freedom: 1088 Total (i.e. Null);  1084 Residual
## Null Deviance:       1496 
## Residual Deviance: 1489  AIC: 1499
summary(glm.fit.weekly.1)
## 
## Call:
## glm(formula = Direction ~ Lag2 + Year * Volume, family = binomial, 
##     data = Weekly)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.517  -1.262   1.004   1.082   1.422  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   9.476345  42.550560   0.223   0.8238  
## Lag2          0.059935   0.026909   2.227   0.0259 *
## Year         -0.004581   0.021359  -0.214   0.8302  
## Volume      -29.828896  43.184789  -0.691   0.4897  
## Year:Volume   0.014842   0.021473   0.691   0.4894  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1489.4  on 1084  degrees of freedom
## AIC: 1499.4
## 
## Number of Fisher Scoring iterations: 4
glm.probs.weekly.1=predict(glm.fit.weekly.1,test.weekly, type ='response')
glm.pred.weekly.1=rep('Down',104)
glm.pred.weekly.1[glm.probs.weekly.1>0.5] = 'Up'
table(glm.pred.weekly.1,Direction.test.weekly)
##                  Direction.test.weekly
## glm.pred.weekly.1 Down Up
##              Down    6  4
##              Up     37 57
mean(glm.pred.weekly.1==Direction.test.weekly)
## [1] 0.6057692
glm.fit.weekly.2 = glm(Direction~Lag1*Lag2, data = Weekly, family = binomial)
glm.fit.weekly.2
## 
## Call:  glm(formula = Direction ~ Lag1 * Lag2, family = binomial, data = Weekly)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2    Lag1:Lag2  
##    0.221393    -0.035783     0.059931     0.002018  
## 
## Degrees of Freedom: 1088 Total (i.e. Null);  1085 Residual
## Null Deviance:       1496 
## Residual Deviance: 1488  AIC: 1496
summary(glm.fit.weekly.2)
## 
## Call:
## glm(formula = Direction ~ Lag1 * Lag2, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.579  -1.263   1.004   1.081   1.572  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.221393   0.061474   3.601 0.000317 ***
## Lag1        -0.035783   0.027992  -1.278 0.201139    
## Lag2         0.059931   0.026649   2.249 0.024521 *  
## Lag1:Lag2    0.002018   0.006729   0.300 0.764207    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1488.1  on 1085  degrees of freedom
## AIC: 1496.1
## 
## Number of Fisher Scoring iterations: 4
glm.probs.weekly.2=predict(glm.fit.weekly.2,test.weekly, type ='response')
glm.pred.weekly.2=rep('Down',104)
glm.pred.weekly.2[glm.probs.weekly.2>0.5] = 'Up'
table(glm.pred.weekly.2,Direction.test.weekly)
##                  Direction.test.weekly
## glm.pred.weekly.2 Down Up
##              Down    5  6
##              Up     38 55
mean(glm.pred.weekly.2==Direction.test.weekly)
## [1] 0.5769231
glm.fit.weekly.3 = glm(Direction~Lag2*Volume, data = Weekly, family = binomial)
glm.fit.weekly.3
## 
## Call:  glm(formula = Direction ~ Lag2 * Volume, family = binomial, data = Weekly)
## 
## Coefficients:
## (Intercept)         Lag2       Volume  Lag2:Volume  
##     0.23724      0.05061     -0.01297      0.00415  
## 
## Degrees of Freedom: 1088 Total (i.e. Null);  1085 Residual
## Null Deviance:       1496 
## Residual Deviance: 1490  AIC: 1498
summary(glm.fit.weekly.3)
## 
## Call:
## glm(formula = Direction ~ Lag2 * Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.589  -1.269   1.015   1.084   1.474  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.23724    0.08418   2.818  0.00483 **
## Lag2         0.05061    0.03911   1.294  0.19560   
## Volume      -0.01297    0.03668  -0.354  0.72363   
## Lag2:Volume  0.00415    0.01068   0.389  0.69764   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1490.1  on 1085  degrees of freedom
## AIC: 1498.1
## 
## Number of Fisher Scoring iterations: 4
glm.probs.weekly.3=predict(glm.fit.weekly.3,test.weekly, type ='response')
glm.pred.weekly.3=rep('Down',104)
glm.pred.weekly.3[glm.probs.weekly.3>0.5] = 'Up'
table(glm.pred.weekly.3,Direction.test.weekly)
##                  Direction.test.weekly
## glm.pred.weekly.3 Down Up
##              Down   10  6
##              Up     33 55
mean(glm.pred.weekly.3==Direction.test.weekly)
## [1] 0.625
library(MASS)
lda.fit.weekly.1=lda(Direction~Lag2+Year*Volume, data=Weekly, subset=train)
plot(lda.fit.weekly.1)

lda.pred.weekly.1=predict(lda.fit.weekly.1, test.weekly)
names(lda.pred.weekly.1)
## [1] "class"     "posterior" "x"
lda.class.weekly.1=lda.pred.weekly.1$class
table(lda.class.weekly.1,Direction.test.weekly)
##                   Direction.test.weekly
## lda.class.weekly.1 Down Up
##               Down   30 42
##               Up     13 19
mean(lda.class.weekly.1==Direction.test.weekly)
## [1] 0.4711538
lda.fit.weekly.2=lda(Direction~Lag1*Lag2, data=Weekly, subset=train)
plot(lda.fit.weekly.2)

lda.pred.weekly.2=predict(lda.fit.weekly.2, test.weekly)
names(lda.pred.weekly.2)
## [1] "class"     "posterior" "x"
lda.class.weekly.2=lda.pred.weekly.2$class
table(lda.class.weekly.2,Direction.test.weekly)
##                   Direction.test.weekly
## lda.class.weekly.2 Down Up
##               Down    7  8
##               Up     36 53
mean(lda.class.weekly.2==Direction.test.weekly)
## [1] 0.5769231
lda.fit.weekly.3=lda(Direction~Lag2*Volume, data=Weekly, subset=train)
plot(lda.fit.weekly.3)

lda.pred.weekly.3=predict(lda.fit.weekly.3, test.weekly)
names(lda.pred.weekly.3)
## [1] "class"     "posterior" "x"
lda.class.weekly.3=lda.pred.weekly.3$class
table(lda.class.weekly.3,Direction.test.weekly)
##                   Direction.test.weekly
## lda.class.weekly.3 Down Up
##               Down   20 25
##               Up     23 36
mean(lda.class.weekly.3==Direction.test.weekly)
## [1] 0.5384615
qda.fit.weekly.1=qda(Direction~Lag2+Year*Volume, data=Weekly, subset=train)
qda.class.weekly.1=predict(qda.fit.weekly.1,test.weekly)$class
table(qda.class.weekly.1,Direction.test.weekly)
##                   Direction.test.weekly
## qda.class.weekly.1 Down Up
##               Down   28 34
##               Up     15 27
mean(qda.class.weekly.1==Direction.test.weekly)
## [1] 0.5288462
qda.fit.weekly.2=qda(Direction~Lag1*Lag2, data=Weekly, subset=train)
qda.class.weekly.2=predict(qda.fit.weekly.2,test.weekly)$class
table(qda.class.weekly.2,Direction.test.weekly)
##                   Direction.test.weekly
## qda.class.weekly.2 Down Up
##               Down   23 36
##               Up     20 25
mean(qda.class.weekly.2==Direction.test.weekly)
## [1] 0.4615385
qda.fit.weekly.3=qda(Direction~Lag2*Volume, data=Weekly, subset=train)
qda.class.weekly.3=predict(qda.fit.weekly.3,test.weekly)$class
table(qda.class.weekly.3,Direction.test.weekly)
##                   Direction.test.weekly
## qda.class.weekly.3 Down Up
##               Down   37 49
##               Up      6 12
mean(qda.class.weekly.3==Direction.test.weekly)
## [1] 0.4711538
library(class)
train.x=cbind(Lag2,Year,Volume)[train,]
test.x=cbind(Lag2,Year,Volume)[!train,]
train.Direction=Direction[train]
length(train.Direction)
## [1] 985
set.seed(1)
knn.pred=knn(train.x,test.x,train.Direction,k=1)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   29 32
##     Up     14 29
mean(knn.pred==Direction.test.weekly)
## [1] 0.5576923
knn.pred=knn(train.x,test.x,train.Direction,k=3)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   33 37
##     Up     10 24
mean(knn.pred==Direction.test.weekly)
## [1] 0.5480769
knn.pred=knn(train.x,test.x,train.Direction,k=5)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   29 36
##     Up     14 25
mean(knn.pred==Direction.test.weekly)
## [1] 0.5192308
train.x=cbind(Lag2,Lag1)[train,]
test.x=cbind(Lag2,Lag1)[!train,]
train.Direction=Direction[train]
length(train.Direction)
## [1] 985
set.seed(1)
knn.pred=knn(train.x,test.x,train.Direction,k=1)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   18 29
##     Up     25 32
mean(knn.pred==Direction.test.weekly)
## [1] 0.4807692
knn.pred=knn(train.x,test.x,train.Direction,k=3)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   22 29
##     Up     21 32
mean(knn.pred==Direction.test.weekly)
## [1] 0.5192308
knn.pred=knn(train.x,test.x,train.Direction,k=5)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   22 32
##     Up     21 29
mean(knn.pred==Direction.test.weekly)
## [1] 0.4903846
train.x=cbind(Lag2,Volume)[train,]
test.x=cbind(Lag2,Volume)[!train,]
train.Direction=Direction[train]
length(train.Direction)
## [1] 985
set.seed(1)
knn.pred=knn(train.x,test.x,train.Direction,k=1)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   26 29
##     Up     17 32
mean(knn.pred==Direction.test.weekly)
## [1] 0.5576923
knn.pred=knn(train.x,test.x,train.Direction,k=3)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   30 34
##     Up     13 27
mean(knn.pred==Direction.test.weekly)
## [1] 0.5480769
knn.pred=knn(train.x,test.x,train.Direction,k=5)
table(knn.pred,Direction.test.weekly)
##         Direction.test.weekly
## knn.pred Down Up
##     Down   28 34
##     Up     15 27
mean(knn.pred==Direction.test.weekly)
## [1] 0.5288462

Of all the additional models that were fit above, the logistic regression model for Direction on Lag2, Volume, and the Lag2*Volume interaction term provided the highest percentage of correct predictions on the test data at 62.5%. This ties the LDA model for Direction on only Lag2.

#11

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

attach(Auto)

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using themedian() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

mpg01 = rep(0,length(mpg))
mpg01[mpg > median(mpg)] = 1
auto = data.frame(Auto,mpg01)

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

cor(auto[, -9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000
pairs(auto)

boxplot(cylinders~mpg01, main = 'MPG by Cylinders')

boxplot(displacement~mpg01, main = 'MPG by Displacement')

boxplot(horsepower~mpg01, main = 'MPG by Horsepower')

boxplot(weight~mpg01, main = 'MPG by Weight')

boxplot(acceleration~mpg01, main = 'MPG by Acceleration')

boxplot(year~mpg01, main = 'MPG by Year')

Looking at the correlation matrix for Auto with mpg01 included, we can see that mpg01 has a strong correlation to several of the Auto variables. Among these, are a strong negative correlation with # of cylinders, engine displacement, horsepower, and vehicle weight. There is a positive correlation with acceleration and year - the newer the car, the more likely it is to have gas mileage greater than the median for the entire dataset.

Since mpg01 is a binary variable, the scatter will not show a trend for mpg01. However, we can see that for horsepower, acceleration, and weight, there is an obvious difference when mpg01 = 1 (greater than median mpg), and when mpg01 = 0 (lower than median mpg).

The boxplot for # of cylinders vs. mpg01 shows that the more cylinders a car has, generally, it will have worse gas mileage. For the most part, most vehicles with higher than median mpg have 4 cylinders. Similarly, cars with higher than median mpg also have samller engines (displacement), less powerful engines (horsepower), and are lighter (weight), than cars with below median mpg. As was previosuly mentioned, higher mpg cars appear to have better acceleration - possibly due to being lighter - and newer cars are more likely to have higher than median mpg.

(c) Split the data into a training set and a test set.

train.auto=(year<79)
test.auto = auto[!train.auto,]
mpg01.test = mpg01[!train.auto]

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

library(MASS)
lda.fit.auto = lda(mpg01~cylinders+weight+displacement+horsepower+year, data = auto, subset = train.auto)
plot(lda.fit.auto)

lda.pred.auto = predict(lda.fit.auto, test.auto)
lda.class.auto = lda.pred.auto$class
table(lda.class.auto,mpg01.test)
##               mpg01.test
## lda.class.auto  0  1
##              0 17 15
##              1  1 81
mean(lda.class.auto != mpg01.test)
## [1] 0.1403509

Fitting an LDA model for mpg01 on cylinders, weight, displacement, horsepower, and year resulted in a model with an error percentage of only ~14%.

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda.fit.auto=qda(mpg01~cylinders+weight+displacement+horsepower+year, data=auto, subset=train.auto)
qda.class.auto=predict(qda.fit.auto,test.auto)$class
table(qda.class.auto,mpg01.test)
##               mpg01.test
## qda.class.auto  0  1
##              0 17 19
##              1  1 77
mean(qda.class.auto != mpg01.test)
## [1] 0.1754386

Fitting a QDA for mpg01 with cylinders, weight, displacement, horsepower, and year results in a test error of ~17.5%.

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

glm.fit.auto = glm(mpg01~cylinders+weight+displacement+horsepower+year, data=auto, subset=train.auto, family = binomial)
summary(glm.fit.auto)
## 
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower + 
##     year, family = binomial, data = auto, subset = train.auto)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4289  -0.1162  -0.0028   0.1674   2.2970  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -0.615923   7.761226  -0.079   0.9367   
## cylinders    -0.465777   0.615851  -0.756   0.4495   
## weight       -0.003961   0.001325  -2.989   0.0028 **
## displacement -0.005918   0.014769  -0.401   0.6886   
## horsepower   -0.043138   0.023610  -1.827   0.0677 . 
## year          0.241357   0.111657   2.162   0.0306 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 363.21  on 277  degrees of freedom
## Residual deviance: 101.47  on 272  degrees of freedom
## AIC: 113.47
## 
## Number of Fisher Scoring iterations: 8
glm.probs.auto = predict(glm.fit.auto,test.auto, type ='response')
glm.pred.auto = rep(0,114)
glm.pred.auto[glm.probs.auto > 0.5] = 1
table(glm.pred.auto,mpg01.test)
##              mpg01.test
## glm.pred.auto  0  1
##             0 17 14
##             1  1 82
mean(glm.pred.auto != mpg01.test)
## [1] 0.1315789

Fitting a logistic regression model on cylinders, weight, displacement, horsepower, and year resulted in a error of ~13.2%. However, looking at the p values of the coefficients, only weight and year might be significant. The model will be refit with only these two variables.

glm.fit.auto = glm(mpg01~weight+year, data=auto, subset=train.auto, family = binomial)
summary(glm.fit.auto)
## 
## Call:
## glm(formula = mpg01 ~ weight + year, family = binomial, data = auto, 
##     subset = train.auto)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.37472  -0.14004  -0.00743   0.20230   2.12167  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.2826234  7.2101288  -0.871  0.38356    
## weight      -0.0062790  0.0008974  -6.997 2.62e-12 ***
## year         0.3065403  0.1086480   2.821  0.00478 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 363.21  on 277  degrees of freedom
## Residual deviance: 107.78  on 275  degrees of freedom
## AIC: 113.78
## 
## Number of Fisher Scoring iterations: 8
glm.probs.auto = predict(glm.fit.auto,test.auto, type ='response')
glm.pred.auto = rep(0,114)
glm.pred.auto[glm.probs.auto > 0.5] = 1
table(glm.pred.auto,mpg01.test)
##              mpg01.test
## glm.pred.auto  0  1
##             0 17 12
##             1  1 84
mean(glm.pred.auto != mpg01.test)
## [1] 0.1140351

After refitting the model on only weight and year, the test error is now ~11.4%. Additionally, both weight and year are now significant.

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

library(class)
train.x=cbind(weight,year)[train.auto,]
test.x=cbind(weight,year)[!train.auto,]
train.mpg01=mpg01[train.auto]
dim(train.x)
## [1] 278   2
dim(test.x)
## [1] 114   2
length(train.mpg01)
## [1] 278
set.seed(1)
knn.pred.auto.1=knn(train.x,test.x,train.mpg01,k=1)
table(knn.pred.auto.1,mpg01.test)
##                mpg01.test
## knn.pred.auto.1  0  1
##               0 18 27
##               1  0 69
mean(knn.pred.auto.1 != mpg01.test)
## [1] 0.2368421
knn.pred.auto.3=knn(train.x,test.x,train.mpg01,k=3)
table(knn.pred.auto.3,mpg01.test)
##                mpg01.test
## knn.pred.auto.3  0  1
##               0 18 33
##               1  0 63
mean(knn.pred.auto.3 != mpg01.test)
## [1] 0.2894737
knn.pred.auto.5=knn(train.x,test.x,train.mpg01,k=5)
table(knn.pred.auto.5,mpg01.test)
##                mpg01.test
## knn.pred.auto.5  0  1
##               0 18 33
##               1  0 63
mean(knn.pred.auto.5 != mpg01.test)
## [1] 0.2894737
knn.pred.auto.7=knn(train.x,test.x,train.mpg01,k=7)
table(knn.pred.auto.7,mpg01.test)
##                mpg01.test
## knn.pred.auto.7  0  1
##               0 18 27
##               1  0 69
mean(knn.pred.auto.7 != mpg01.test)
## [1] 0.2368421

For this data set, it seems like either k=1 or k=7 performs the best.

#13

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various sub-sets of the predictors. Describe your findings.

library(MASS)
attach(Boston)

crim01 = rep(0,length(crim))
crim01[crim > median(crim)] = 1
bos = data.frame(Boston,crim01)
cor(bos)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## crim01   0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
## crim01  -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128
##            ptratio       black      lstat       medv      crim01
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046  0.40939545
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453 -0.43615103
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252  0.60326017
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602  0.07009677
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208  0.72323480
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599 -0.15637178
## age      0.2615150 -0.27353398  0.6023385 -0.3769546  0.61393992
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287 -0.61634164
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262  0.61978625
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867  0.25356836
## black   -0.1773833  1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627  0.45326273
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000 -0.26301673
## crim01   0.2535684 -0.35121093  0.4532627 -0.2630167  1.00000000
boxplot(tax~crim01)

boxplot(nox~crim01)

boxplot(age~crim01)

boxplot(dis~crim01)

boxplot(indus~crim01)

boxplot(zn~crim01)

Looking at the correlation matrix, crim01 has a strong positive correlation with indus, nox, age, tax, and rad, and a strong negative correlation with dis. Zn also has a fairly strong negative correlation.

Looking at the box plots for several variables compared to the different levels of crim01, it is obvious that there are strong relationships between crim01 and many of the variables in the Boston dataset. One plot to note is the one for zn by crim01. Other than two outliers, there are no neighborhoods with a lot of crime, and a lot of land zoned for 25,000 sq.ft plots.

train.index = sample(1:nrow(bos), .8 * nrow(bos))
test.index = setdiff(1:nrow(bos),train.index)

train.bos = bos[train.index,]
test.bos = bos[test.index,]
crim01.test = crim01[test.index]
glm.fit.bos = glm(crim01~tax+nox+age+dis+indus+zn, data = train.bos,family = binomial)
summary(glm.fit.bos)
## 
## Call:
## glm(formula = crim01 ~ tax + nox + age + dis + indus + zn, family = binomial, 
##     data = train.bos)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.10546  -0.36781  -0.02329   0.43187   2.63422  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.386228   3.176303  -6.418 1.38e-10 ***
## tax           0.004187   0.001542   2.715  0.00663 ** 
## nox          34.594481   5.833904   5.930 3.03e-09 ***
## age           0.003116   0.008722   0.357  0.72092    
## dis           0.326027   0.169313   1.926  0.05416 .  
## indus        -0.090324   0.038802  -2.328  0.01992 *  
## zn           -0.049185   0.023998  -2.050  0.04041 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 560.02  on 403  degrees of freedom
## Residual deviance: 243.17  on 397  degrees of freedom
## AIC: 257.17
## 
## Number of Fisher Scoring iterations: 7
glm.probs.bos = predict(glm.fit.bos, test.bos, type ='response')
length(crim01.test)
## [1] 102
glm.pred.bos = rep(0,102)
glm.pred.bos[glm.probs.bos > 0.5] = 1
length(glm.pred.bos)
## [1] 102
table(glm.pred.bos,crim01.test)
##             crim01.test
## glm.pred.bos  0  1
##            0 41  5
##            1  8 48
mean(glm.pred.bos == crim01.test)
## [1] 0.872549

Fitting a model for crim01 (crime above or below median crime rate) on tax, nox, age, dis, indus, and zn resulted in a model with a correct prediction rate on the test data of ~92.2%. Looking at the summary of the model, we can see that tax, nox, indus, and zn are potentially significant variables. A new model will be refit with only these variables.

glm.fit.bos.1 = glm(crim01~tax+nox+indus+zn, data = train.bos,family = binomial)
summary(glm.fit.bos.1)
## 
## Call:
## glm(formula = crim01 ~ tax + nox + indus + zn, family = binomial, 
##     data = train.bos)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.08840  -0.35106  -0.04306   0.41263   2.67700  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -16.454221   2.299192  -7.157 8.27e-13 ***
## tax           0.004087   0.001521   2.688   0.0072 ** 
## nox          29.895334   4.854787   6.158 7.37e-10 ***
## indus        -0.093621   0.038140  -2.455   0.0141 *  
## zn           -0.035715   0.021202  -1.684   0.0921 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 560.02  on 403  degrees of freedom
## Residual deviance: 246.93  on 399  degrees of freedom
## AIC: 256.93
## 
## Number of Fisher Scoring iterations: 7
glm.probs.bos.1 = predict(glm.fit.bos.1, test.bos, type ='response')
length(crim01.test)
## [1] 102
glm.pred.bos.1 = rep(0,102)
glm.pred.bos.1[glm.probs.bos.1 > 0.5] = 1
length(glm.pred.bos.1)
## [1] 102
table(glm.pred.bos.1,crim01.test)
##               crim01.test
## glm.pred.bos.1  0  1
##              0 42  5
##              1  7 48
mean(glm.pred.bos.1 == crim01.test)
## [1] 0.8823529

This model still results in ~92% correct predicitons.

lda.fit.bos = lda(crim01~tax+nox+indus+zn, data = train.bos)
plot(lda.fit.bos)

lda.pred.bos = predict(lda.fit.bos, test.bos)
lda.class.bos = lda.pred.bos$class
table(lda.class.bos,crim01.test)
##              crim01.test
## lda.class.bos  0  1
##             0 45 11
##             1  4 42
mean(lda.class.bos == crim01.test)
## [1] 0.8529412

Fitting an LDA model for crim01 with tax, nox, indus, and zn results in a model that correctly predicts ~84.3% of the test values.

qda.fit.bos=qda(crim01~tax+nox+indus+zn, data=train.bos)
qda.class.bos=predict(qda.fit.bos,test.bos)$class
table(qda.class.bos,crim01.test)
##              crim01.test
## qda.class.bos  0  1
##             0 38  4
##             1 11 49
mean(qda.class.bos == crim01.test)
## [1] 0.8529412

A QDA model fit on the same variables results in the same percentage of correct predictions ~84.3%.

library(class)
train.x=cbind(tax,nox,indus,zn)[train.index,]
test.x=cbind(tax,nox,indus,zn)[test.index,]
train.crim01=crim01[train.index]
dim(train.x)
## [1] 404   4
dim(test.x)
## [1] 102   4
length(train.crim01)
## [1] 404
set.seed(1)
knn.pred.bos.1=knn(train.x,test.x,train.crim01,k=1)
table(knn.pred.bos.1,crim01.test)
##               crim01.test
## knn.pred.bos.1  0  1
##              0 48  3
##              1  1 50
mean(knn.pred.bos.1 == crim01.test)
## [1] 0.9607843
knn.pred.bos.3=knn(train.x,test.x,train.crim01,k=3)
table(knn.pred.bos.3,crim01.test)
##               crim01.test
## knn.pred.bos.3  0  1
##              0 46  3
##              1  3 50
mean(knn.pred.bos.3 == crim01.test)
## [1] 0.9411765
knn.pred.bos.5=knn(train.x,test.x,train.crim01,k=5)
table(knn.pred.bos.5,crim01.test)
##               crim01.test
## knn.pred.bos.5  0  1
##              0 44  3
##              1  5 50
mean(knn.pred.bos.5 == crim01.test)
## [1] 0.9215686
knn.pred.bos.7=knn(train.x,test.x,train.crim01,k=7)
table(knn.pred.bos.7,crim01.test)
##               crim01.test
## knn.pred.bos.7  0  1
##              0 42  3
##              1  7 50
mean(knn.pred.bos.7 == crim01.test)
## [1] 0.9019608

Several KNN models were fit to predict crim01 with tax, nox, indus, and zn. All of the models have very high correct prediction rates, with the model for K=1 having the best with ~96.1%. The models for K = 3, 5, and 7 all had correct prediction rates of ~94%.

All of the fit models had high prediction rates. Of all of the types of models, the KNN models seemed to perform the best. Although the model with K = 1 had the highest prediction rate, one of the other KNN models where K = 3 or K = 5 would likely be a better choice.