Exercises

Applied

10. This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR)
library(MASS)
library(class)

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
attach(Weekly)
pairs(Weekly)

cor(`Weekly` [1:8])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

The correlation between the lag variables and today’s are not strong. There is an important correlation between Year and Volume. By plotting the data we can see that Volume increases over time. That means that average numbers of share traded increased from 1990 to 2010.

plot(Weekly$Volume)

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so,which ones?

weekly_glm <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(weekly_glm)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

The only predictor that is statistically significant is Lag2 with a \(p-value < alpha(0.05)\). The negative coefficient of this variable suggests that if the market had a positive return two days ago, then it is less likely to go up today.

coef(weekly_glm)
## (Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
##  0.26686414 -0.04126894  0.05844168 -0.01606114 -0.02779021 -0.01447206 
##      Volume 
## -0.02274153

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

weekly_probs <- predict(weekly_glm,type="response")
weekly_probs[1:10]
##         1         2         3         4         5         6         7         8 
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972 
##         9        10 
## 0.5715200 0.5554287
contrasts(Weekly$Direction)
##      Up
## Down  0
## Up    1
weekly_pred <- rep("Down", 1089)
weekly_pred[weekly_probs > .5] = "Up"
table(weekly_pred, Weekly$Direction)
##            
## weekly_pred Down  Up
##        Down   54  48
##        Up    430 557
accuracy <- (54+557)/(54+48+430+557)
accuracy
## [1] 0.5610652
mean(weekly_pred == Weekly$Direction)
## [1] 0.5610652

The confusion matrix indicates that the model correctly predicted \(54 + 557 = 611\) in total out of 1089 observations which is 56.11% in accuracy.However this result is misleading because we used the entire data set to make the prediction. We need to split the data set to train the model on one part of the data and use the other part to predict.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train <- (Year < 2009)
Weekly.2010 <- Weekly[!train,]
Direction.2010 <- Direction[!train]
dim(Weekly.2010)
## [1] 104   9
weekly_train <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
summary(weekly_train)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
weeklyProb <- predict(weekly_train, Weekly.2010, type = "response" )
weeklyPred <- rep("Down", 104)
weeklyPred[weeklyProb > .5] = "Up"
table(weeklyPred, Direction.2010)
##           Direction.2010
## weeklyPred Down Up
##       Down    9  5
##       Up     34 56
mean(weeklyPred == Direction.2010)
## [1] 0.625
mean(weeklyPred != Direction.2010)
## [1] 0.375

(e) Repeat (d) using LDA.

weekly.lda <- lda(Direction ~ Lag2, data = Weekly, subset = train)
weekly.lda
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162
plot(weekly.lda)

lda.pred <- predict(weekly.lda, Weekly.2010)
names(lda.pred)
## [1] "class"     "posterior" "x"
lda.class=lda.pred$class
table(lda.class ,Direction.2010)
##          Direction.2010
## lda.class Down Up
##      Down    9  5
##      Up     34 56
mean(lda.class == Direction.2010)
## [1] 0.625
sum(lda.pred$posterior[,1]>=.5)
## [1] 14
sum(lda.pred$posterior[,1]<.5)
## [1] 90
lda.pred$posterior[1:20,1]
##       986       987       988       989       990       991       992       993 
## 0.4736555 0.3558617 0.5132860 0.5142948 0.4799727 0.4597586 0.3771117 0.5184724 
##       994       995       996       997       998       999      1000      1001 
## 0.5480397 0.5146118 0.5504246 0.3055404 0.4268160 0.3637275 0.4034316 0.4256310 
##      1002      1003      1004      1005 
## 0.4277053 0.4548626 0.4308002 0.3674066
lda.class[1:20]
##  [1] Up   Up   Down Down Up   Up   Up   Down Down Down Down Up   Up   Up   Up  
## [16] Up   Up   Up   Up   Up  
## Levels: Down Up
sum(lda.pred$posterior[,1]>.9)
## [1] 0

(f) Repeat (d) using QDA.

weekly.qda <- qda(Direction ~ Lag2, data = Weekly, subset = train)
weekly.qda 
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
qda.pred <- predict(weekly.qda, Weekly.2010)
names(qda.pred)
## [1] "class"     "posterior"
qda.class=qda.pred$class
table(qda.class ,Direction.2010)
##          Direction.2010
## qda.class Down Up
##      Down    0  0
##      Up     43 61
mean(qda.class == Direction.2010)
## [1] 0.5865385

(g) Repeat (d) using KNN with \(K = 1\).

train1 <- Weekly[Weekly$Year <= 2008, ]
test1 <- Weekly[Weekly$Year > 2008, ]
train.X <- data.frame(train1$Lag2)
test.X <- data.frame(test1$Lag2)
train.Direction <- train1$Direction
set.seed(1)
knn.pred <- knn(train.X,test.X,train.Direction , k=1)
table(knn.pred ,Direction.2010)
##         Direction.2010
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn.pred == Direction.2010)
## [1] 0.5

(h) Which of these methods appears to provide the best results on this data?

LDA & Logistic Regression get the same test accuracy of 0.625, so these two are better model than QDA (0.587) and KNN (0.5).

(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

weekly2 <- Weekly[-c(8)]
GLM Model
weekly_glm1 <- glm(Direction ~ .*., data = weekly2, family = binomial, subset = train)
summary(weekly_glm1)
## 
## Call:
## glm(formula = Direction ~ . * ., family = binomial, data = weekly2, 
##     subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6744  -1.2402   0.9261   1.0812   1.5783  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  3.934e+01  6.232e+01   0.631   0.5278  
## Year        -1.958e-02  3.132e-02  -0.625   0.5318  
## Lag1        -3.735e+01  2.248e+01  -1.661   0.0966 .
## Lag2        -1.923e+01  2.140e+01  -0.899   0.3689  
## Lag3         3.740e+00  2.142e+01   0.175   0.8614  
## Lag4        -2.279e+00  2.109e+01  -0.108   0.9140  
## Lag5        -9.274e+00  2.141e+01  -0.433   0.6649  
## Volume       1.588e+01  7.710e+01   0.206   0.8368  
## Year:Lag1    1.870e-02  1.126e-02   1.660   0.0968 .
## Year:Lag2    9.655e-03  1.072e-02   0.901   0.3678  
## Year:Lag3   -1.891e-03  1.073e-02  -0.176   0.8602  
## Year:Lag4    1.118e-03  1.056e-02   0.106   0.9158  
## Year:Lag5    4.611e-03  1.073e-02   0.430   0.6673  
## Year:Volume -7.909e-03  3.835e-02  -0.206   0.8366  
## Lag1:Lag2   -6.246e-03  1.134e-02  -0.551   0.5818  
## Lag1:Lag3    8.394e-03  1.264e-02   0.664   0.5068  
## Lag1:Lag4   -8.752e-03  1.062e-02  -0.824   0.4098  
## Lag1:Lag5   -1.613e-02  1.233e-02  -1.307   0.1911  
## Lag1:Volume -7.181e-02  3.812e-02  -1.884   0.0596 .
## Lag2:Lag3    4.234e-03  1.087e-02   0.389   0.6969  
## Lag2:Lag4    6.066e-04  1.237e-02   0.049   0.9609  
## Lag2:Lag5   -6.461e-03  1.084e-02  -0.596   0.5512  
## Lag2:Volume -1.005e-02  3.401e-02  -0.295   0.7677  
## Lag3:Lag4    1.721e-02  1.044e-02   1.649   0.0992 .
## Lag3:Lag5   -1.199e-02  1.212e-02  -0.989   0.3227  
## Lag3:Volume  2.267e-02  3.652e-02   0.621   0.5348  
## Lag4:Lag5   -3.819e-03  1.090e-02  -0.350   0.7260  
## Lag4:Volume  6.133e-03  3.449e-02   0.178   0.8589  
## Lag5:Volume  1.381e-02  3.743e-02   0.369   0.7120  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1327.2  on 956  degrees of freedom
## AIC: 1385.2
## 
## Number of Fisher Scoring iterations: 5
weekly_probs2 <- predict(weekly_glm1,Weekly.2010, type="response")
weekly_Pred2 <- rep("Down", 104)
weekly_Pred2[weekly_probs2 > .5] = "Up"
table(weekly_Pred2, Direction.2010)
##             Direction.2010
## weekly_Pred2 Down Up
##         Down   23 25
##         Up     20 36
mean(weekly_Pred2 == Direction.2010)
## [1] 0.5673077
weekly_glm2 <- step(weekly_glm1,
direction="backward",test="Chisq", trace = F)

summary(weekly_glm2)
## 
## Call:
## glm(formula = Direction ~ Year + Lag1 + Lag2 + Lag3 + Lag4 + 
##     Lag5 + Volume + Year:Lag1 + Lag1:Lag5 + Lag1:Volume + Lag3:Lag4, 
##     family = binomial, data = weekly2, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6940  -1.2482   0.9681   1.0859   1.7791  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   3.159567  44.275401   0.071   0.9431  
## Year         -0.001412   0.022200  -0.064   0.9493  
## Lag1        -32.629712  21.173970  -1.541   0.1233  
## Lag2          0.047337   0.030839   1.535   0.1248  
## Lag3         -0.004278   0.031828  -0.134   0.8931  
## Lag4         -0.037207   0.031209  -1.192   0.2332  
## Lag5         -0.039435   0.029950  -1.317   0.1879  
## Volume       -0.092176   0.104161  -0.885   0.3762  
## Year:Lag1     0.016334   0.010608   1.540   0.1236  
## Lag1:Lag5    -0.016050   0.011114  -1.444   0.1487  
## Lag1:Volume  -0.060453   0.033451  -1.807   0.0707 .
## Lag3:Lag4     0.014767   0.008953   1.649   0.0991 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1334.2  on 973  degrees of freedom
## AIC: 1358.2
## 
## Number of Fisher Scoring iterations: 5
weekly_probs3 <- predict(weekly_glm2,Weekly.2010, type="response")
weekly_Pred3 <- rep("Down", 104)
weekly_Pred3[weekly_probs3 > .5] = "Up"
table(weekly_Pred3, Direction.2010)
##             Direction.2010
## weekly_Pred3 Down Up
##         Down   28 42
##         Up     15 19
mean(weekly_Pred3 == Direction.2010)
## [1] 0.4519231

This model did not improved with different combination of predictors and interactions.

LDA Model
weekly.lda1 <- lda(Direction ~ .*., data = weekly2, subset = train)
weekly.lda1
## Call:
## lda(Direction ~ . * ., data = weekly2, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##          Year         Lag1        Lag2       Lag3       Lag4       Lag5
## Down 1999.295  0.289444444 -0.03568254 0.17080045 0.15925624 0.21409297
## Up   1998.853 -0.009213235  0.26036581 0.08404044 0.09220956 0.04548897
##        Volume Year:Lag1 Year:Lag2 Year:Lag3 Year:Lag4 Year:Lag5 Year:Volume
## Down 1.266966 578.19599 -72.78371  340.8003  317.5336 426.84087    2538.931
## Up   1.156529 -19.47546 520.28269  167.2162  183.6835  90.64494    2317.141
##       Lag1:Lag2    Lag1:Lag3  Lag1:Lag4  Lag1:Lag5 Lag1:Volume  Lag2:Lag3
## Down -0.8014495 -0.006554329 -0.4724269  0.3615693   0.3362336 -0.1937158
## Up   -0.1393632  0.687742739 -0.2036205 -0.1833118  -0.6187543 -0.6405132
##       Lag2:Lag4  Lag2:Lag5 Lag2:Volume    Lag3:Lag4 Lag3:Lag5 Lag3:Volume
## Down 0.78824608 -0.3132494 -0.70124208 -0.991496916 0.4913572  -0.1417243
## Up   0.04407141 -0.3497535  0.06257301  0.007162822 0.3137873  -0.2956207
##       Lag4:Lag5 Lag4:Volume Lag5:Volume
## Down -0.3686248  -0.2257271  -0.2323645
## Up   -0.4945795  -0.1677342  -0.1401411
## 
## Coefficients of linear discriminants:
##                       LD1
## Year        -6.050854e-02
## Lag1        -8.225884e+01
## Lag2        -5.381683e+01
## Lag3         1.540054e+01
## Lag4        -6.628949e-01
## Lag5        -3.280721e+01
## Volume       4.457784e+01
## Year:Lag1    4.115458e-02
## Year:Lag2    2.702039e-02
## Year:Lag3   -7.766719e-03
## Year:Lag4    2.585077e-04
## Year:Lag5    1.632926e-02
## Year:Volume -2.219095e-02
## Lag1:Lag2   -2.686924e-02
## Lag1:Lag3    1.823994e-02
## Lag1:Lag4   -2.502826e-02
## Lag1:Lag5   -4.121845e-02
## Lag1:Volume -1.413024e-01
## Lag2:Lag3    1.314428e-02
## Lag2:Lag4   -3.126230e-03
## Lag2:Lag5   -1.322231e-02
## Lag2:Volume -2.089106e-02
## Lag3:Lag4    4.962695e-02
## Lag3:Lag5   -3.042063e-02
## Lag3:Volume  7.665918e-02
## Lag4:Lag5   -1.219883e-02
## Lag4:Volume  3.193697e-02
## Lag5:Volume  3.182095e-02
lda.pred1 <- predict(weekly.lda1, Weekly.2010)
lda.class1=lda.pred1$class
table(lda.class1 ,Direction.2010)
##           Direction.2010
## lda.class1 Down Up
##       Down   22 23
##       Up     21 38
mean(lda.class1 == Direction.2010)
## [1] 0.5769231

This model did not improved with different combination of predictors and interactions.

QDA Model
weekly.qda1 <- qda(Direction ~ .*., data = weekly2, subset = train)
weekly.qda1 
## Call:
## qda(Direction ~ . * ., data = weekly2, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##          Year         Lag1        Lag2       Lag3       Lag4       Lag5
## Down 1999.295  0.289444444 -0.03568254 0.17080045 0.15925624 0.21409297
## Up   1998.853 -0.009213235  0.26036581 0.08404044 0.09220956 0.04548897
##        Volume Year:Lag1 Year:Lag2 Year:Lag3 Year:Lag4 Year:Lag5 Year:Volume
## Down 1.266966 578.19599 -72.78371  340.8003  317.5336 426.84087    2538.931
## Up   1.156529 -19.47546 520.28269  167.2162  183.6835  90.64494    2317.141
##       Lag1:Lag2    Lag1:Lag3  Lag1:Lag4  Lag1:Lag5 Lag1:Volume  Lag2:Lag3
## Down -0.8014495 -0.006554329 -0.4724269  0.3615693   0.3362336 -0.1937158
## Up   -0.1393632  0.687742739 -0.2036205 -0.1833118  -0.6187543 -0.6405132
##       Lag2:Lag4  Lag2:Lag5 Lag2:Volume    Lag3:Lag4 Lag3:Lag5 Lag3:Volume
## Down 0.78824608 -0.3132494 -0.70124208 -0.991496916 0.4913572  -0.1417243
## Up   0.04407141 -0.3497535  0.06257301  0.007162822 0.3137873  -0.2956207
##       Lag4:Lag5 Lag4:Volume Lag5:Volume
## Down -0.3686248  -0.2257271  -0.2323645
## Up   -0.4945795  -0.1677342  -0.1401411
qda.pred1 <- predict(weekly.qda1, Weekly.2010)
qda.class1=qda.pred1$class
table(qda.class1 ,Direction.2010)
##           Direction.2010
## qda.class1 Down Up
##       Down   28 39
##       Up     15 22
mean(qda.class1 == Direction.2010)
## [1] 0.4807692

This model did not improved with different combination of predictors and interactions.

KNN Model
set.seed(1)
knn.pred2 <- knn(train.X,test.X,train.Direction , k=4)
table(knn.pred2 ,Direction.2010)
##          Direction.2010
## knn.pred2 Down Up
##      Down   20 17
##      Up     23 44
mean(knn.pred2 == Direction.2010)
## [1] 0.6153846
set.seed(1)
knn.pred3 <- knn(train.X,test.X,train.Direction , k=5)
table(knn.pred3 ,Direction.2010)
##          Direction.2010
## knn.pred3 Down Up
##      Down   16 21
##      Up     27 40
mean(knn.pred3 == Direction.2010)
## [1] 0.5384615

For the KNN model I used the original train and test from the previous exercise, however, I change the K values. The model improve with \(K = 4\), but started to decreased with \(K > 5\).

11. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

auto <- Auto
str(auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
auto$mpg01 <- factor(as.numeric(auto$mpg > median(auto$mpg)))
str(auto)
## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  $ mpg01       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

library(ggplot2)
auto$name <- NULL
auto$origin <- factor(auto$origin, labels = c("American", "European", "Japanese"))
str(auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : Factor w/ 3 levels "American","European",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ mpg01       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
pairs(auto)

ggplot(auto, aes(x = mpg01, y = cylinders, col = mpg01)) + 
  geom_jitter()  + 
  ggtitle("Cylinders vs mpg01 - Scatterplot") 

ggplot(auto, aes(x = mpg01, y = cylinders, fill = mpg01)) + 
  geom_boxplot()  + 
  ggtitle("Cylinders vs mpg01 - Boxplot")

ggplot(auto, aes(x = mpg01, y = displacement, fill = mpg01)) + 
  geom_boxplot()  + 
  ggtitle("Displacement vs mpg01 - Boxplot") 

ggplot(auto, aes(x = mpg01, y = displacement, col = mpg01)) + 
  geom_jitter()  + 
  ggtitle("Displacement vs mpg01 - Scatterplot")

ggplot(auto, aes(x = mpg01, y = horsepower, fill = mpg01)) + 
  geom_boxplot()  + 
  ggtitle("Horsepower vs mpg01 - Boxplot") 

ggplot(auto, aes(x = mpg01, y = weight, fill = mpg01)) + 
  geom_boxplot()  + 
  ggtitle("Weight vs mpg01 - Boxplot") 

ggplot(auto, aes(x = mpg01, y = acceleration, fill = mpg01)) + 
  geom_boxplot()  + 
  ggtitle("Acceleration vs mpg01 - Boxplot") 

ggplot(auto, aes(x = mpg01, y = year, fill = mpg01)) + 
  geom_boxplot()  + 
  ggtitle("Year vs mpg01 - Boxplot")

ggplot(auto, aes(x = origin, fill = mpg01)) + 
  geom_bar()  + 
  ggtitle("Origin vs mpg01 - Bar")

Because mpg1 is binary is hard to see the predicting power of the variable through the scatterplot or boxplot.However cylinders + displacement + horsepower + weight + acceleration could be use to train the model.

(c) Split the data into a training set and a test set.

set.seed(1)
tr_ind <- sample(nrow(auto),0.8*nrow(auto),replace = F)

auto_train <- auto[tr_ind,]
auto_test <- auto[-tr_ind,]
dim(auto_train)
## [1] 313   9
dim(auto_test)
## [1] 79  9

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

auto_lda <- lda(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration, data = auto_train)
(auto_lda)
## Call:
## lda(mpg01 ~ cylinders + displacement + horsepower + weight + 
##     acceleration, data = auto_train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4920128 0.5079872 
## 
## Group means:
##   cylinders displacement horsepower   weight acceleration
## 0  6.753247     269.6558   128.0844 3598.026     14.76104
## 1  4.207547     117.5723    79.8239 2350.283     16.49057
## 
## Coefficients of linear discriminants:
##                        LD1
## cylinders    -0.3683711581
## displacement -0.0034011244
## horsepower    0.0043408694
## weight       -0.0009441519
## acceleration  0.0002113197
auto_lda.pred <- predict(auto_lda, newdata = auto_test)
auto_lda.class=auto_lda.pred$class
table(auto_lda.class ,auto_test$mpg01)
##               
## auto_lda.class  0  1
##              0 35  0
##              1  7 37
mean(auto_lda.class != auto_test$mpg01)
## [1] 0.08860759
mean(auto_lda.class == auto_test$mpg01)
## [1] 0.9113924

Test error of the model obtained is 0.0886

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

auto_qda <- qda(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration, data = auto_train)
(auto_qda)
## Call:
## qda(mpg01 ~ cylinders + displacement + horsepower + weight + 
##     acceleration, data = auto_train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4920128 0.5079872 
## 
## Group means:
##   cylinders displacement horsepower   weight acceleration
## 0  6.753247     269.6558   128.0844 3598.026     14.76104
## 1  4.207547     117.5723    79.8239 2350.283     16.49057
auto_qda.pred <- predict(auto_qda, newdata = auto_test)
auto_qda.class <- auto_qda.pred$class
table(auto_qda.class ,auto_test$mpg01)
##               
## auto_qda.class  0  1
##              0 37  2
##              1  5 35
mean(auto_qda.class != auto_test$mpg01)
## [1] 0.08860759

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

auto_glm <- glm(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration, data = auto_train, family = binomial)
summary(auto_glm)
## 
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower + 
##     weight + acceleration, family = binomial, data = auto_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4243  -0.2259   0.1234   0.4160   3.2483  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  11.8893784  2.9794937   3.990  6.6e-05 ***
## cylinders     0.0527884  0.3685127   0.143   0.8861    
## displacement -0.0114217  0.0087598  -1.304   0.1923    
## horsepower   -0.0403983  0.0224690  -1.798   0.0722 .  
## weight       -0.0020999  0.0009563  -2.196   0.0281 *  
## acceleration -0.0198422  0.1349911  -0.147   0.8831    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 433.83  on 312  degrees of freedom
## Residual deviance: 175.11  on 307  degrees of freedom
## AIC: 187.11
## 
## Number of Fisher Scoring iterations: 7
auto_glm.prob <- predict(auto_glm, newdata = auto_test, type="response")
auto_glm.pred <- rep("0", 79)
auto_glm.pred[auto_glm.prob > .5] = "1"
table(auto_glm.pred, auto_test$mpg01)
##              
## auto_glm.pred  0  1
##             0 38  1
##             1  4 36
mean(auto_glm.pred != auto_test$mpg01)
## [1] 0.06329114

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

auto_train.X = data.frame(auto_train$cylinders, auto_train$displacement, auto_train$horsepower, auto_train$weight, auto_train$acceleration)
auto_test.X = data.frame(auto_test$cylinders, auto_test$displacement, auto_test$horsepower, auto_test$weight, auto_test$acceleration)
train.mpg01 = auto_train$mpg01
set.seed(1)
auto_knn.pred = knn(auto_train.X,auto_test.X,train.mpg01 ,k=1)
table(auto_knn.pred ,auto_test$mpg01)
##              
## auto_knn.pred  0  1
##             0 36  4
##             1  6 33
mean(auto_knn.pred != auto_test$mpg01)
## [1] 0.1265823
set.seed(1)
auto_knn.pred1 = knn(auto_train.X,auto_test.X,train.mpg01 ,k=3)
table(auto_knn.pred1 ,auto_test$mpg01)
##               
## auto_knn.pred1  0  1
##              0 38  4
##              1  4 33
mean(auto_knn.pred1 != auto_test$mpg01)
## [1] 0.1012658
set.seed(1)
auto_knn.pred2 = knn(auto_train.X,auto_test.X,train.mpg01 ,k=4)
table(auto_knn.pred2 ,auto_test$mpg01)
##               
## auto_knn.pred2  0  1
##              0 39  5
##              1  3 32
mean(auto_knn.pred2 != auto_test$mpg01)
## [1] 0.1012658

The test error obtain with k = 4 is the best, after K > 4, the model does not improve anymore.

13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

Boston.df <- Boston
str(Boston.df)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston.df$crim <- factor(ifelse(Boston.df$crim > median(Boston.df$crim), 1, 0))
str(Boston.df)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
set.seed(1)
tr_index <- sample(nrow(Boston.df),0.8*nrow(Boston.df),replace = F)

Boston_train <- Boston.df[tr_ind,]
Boston_test <- Boston.df[-tr_ind,]
dim(Boston_train)
## [1] 313  14
dim(Boston_test)
## [1] 193  14

Logistic Regression Model:

Boston_glm <- glm(crim ~ ., data = Boston_train, family = binomial)
summary(Boston_glm)
## 
## Call:
## glm(formula = crim ~ ., family = binomial, data = Boston_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1506  -0.1685  -0.0189   0.1251   3.5854  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.065e+01  8.473e+00  -4.797 1.61e-06 ***
## zn          -6.112e-02  4.170e-02  -1.466  0.14272    
## indus       -1.019e-01  6.072e-02  -1.678  0.09344 .  
## chas         1.043e+00  8.597e-01   1.214  0.22490    
## nox          6.102e+01  1.102e+01   5.535 3.11e-08 ***
## rm          -1.410e+00  9.040e-01  -1.560  0.11885    
## age          3.340e-02  1.617e-02   2.066  0.03887 *  
## dis          7.721e-01  2.868e-01   2.692  0.00710 ** 
## rad          6.945e-01  2.182e-01   3.183  0.00146 ** 
## tax         -3.567e-03  4.523e-03  -0.789  0.43031    
## ptratio      4.649e-01  1.569e-01   2.962  0.00306 ** 
## black       -8.374e-03  5.690e-03  -1.472  0.14108    
## lstat        9.342e-04  6.378e-02   0.015  0.98831    
## medv         2.168e-01  9.138e-02   2.373  0.01764 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 421.14  on 312  degrees of freedom
## Residual deviance: 144.39  on 299  degrees of freedom
## AIC: 172.39
## 
## Number of Fisher Scoring iterations: 9
Boston_glm.prob <- predict(Boston_glm, newdata = Boston_test, type="response")
Boston_glm.pred <- rep("0", 193)
Boston_glm.pred[Boston_glm.prob > .5] = "1"
table(Boston_glm.pred, Boston_test$crim)
##                
## Boston_glm.pred   0   1
##               0  56   6
##               1   9 122
mean(Boston_glm.pred == Boston_test$crim)
## [1] 0.9222798

LDA Model

Boston_lda <- lda(crim ~ ., data = Boston_train)
(Boston_lda)
## Call:
## lda(crim ~ ., data = Boston_train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.600639 0.399361 
## 
## Group means:
##         zn    indus       chas       nox       rm      age      dis      rad
## 0 21.24734  6.55016 0.05319149 0.4651479 6.422298 50.07074 5.163038 4.154255
## 1  2.09600 13.36864 0.16000000 0.6193920 6.227088 85.72720 2.675730 9.904000
##        tax  ptratio    black     lstat     medv
## 0 294.3777 17.87287 388.3455  9.214681 25.32926
## 1 420.2400 18.42400 370.4190 13.777120 23.54480
## 
## Coefficients of linear discriminants:
##                   LD1
## zn      -0.0031146382
## indus   -0.0041399657
## chas     0.0715290350
## nox     10.2734899792
## rm       0.2483535133
## age      0.0140322957
## dis      0.0504023906
## rad      0.0105425620
## tax      0.0007134144
## ptratio  0.0965812895
## black   -0.0046669146
## lstat    0.0137673862
## medv     0.0265661070
Boston_lda.pred <- predict(Boston_lda, newdata = Boston_test)
Boston_lda.class <- Boston_lda.pred$class
table(Boston_lda.class ,Boston_test$crim)
##                 
## Boston_lda.class   0   1
##                0  52  10
##                1  13 118
mean(Boston_lda.class == Boston_test$crim)
## [1] 0.880829

QDA

Boston_qda <- qda(crim ~ ., data = Boston_train)
(Boston_qda)
## Call:
## qda(crim ~ ., data = Boston_train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.600639 0.399361 
## 
## Group means:
##         zn    indus       chas       nox       rm      age      dis      rad
## 0 21.24734  6.55016 0.05319149 0.4651479 6.422298 50.07074 5.163038 4.154255
## 1  2.09600 13.36864 0.16000000 0.6193920 6.227088 85.72720 2.675730 9.904000
##        tax  ptratio    black     lstat     medv
## 0 294.3777 17.87287 388.3455  9.214681 25.32926
## 1 420.2400 18.42400 370.4190 13.777120 23.54480
Boston_qda.pred <- predict(Boston_qda, newdata = Boston_test)
Boston_qda.class <- Boston_qda.pred$class
table(Boston_qda.class ,Boston_test$crim)
##                 
## Boston_qda.class   0   1
##                0  61  11
##                1   4 117
mean(Boston_qda.class == Boston_test$crim)
## [1] 0.9222798

KNN Model

Boston_train.X = data.frame(Boston_train)
Boston_test.X = data.frame(Boston_test)
train.crim = Boston_train$crim
set.seed(1)
Boston_knn.pred = knn(Boston_train.X, Boston_test.X, train.crim ,k=1)
table(Boston_knn.pred ,Boston_test$crim)
##                
## Boston_knn.pred   0   1
##               0  56   5
##               1   9 123
mean(Boston_knn.pred == Boston_test$crim)
## [1] 0.9274611

I fitted all 4 models, using all the predictors each time. The best model was KNN with an accuracy of 0.9274.