Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.library(ISLR)
library(MASS)
library(class)
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
attach(Weekly)
pairs(Weekly)
cor(`Weekly` [1:8])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
The correlation between the lag variables and today’s are not strong. There is an important correlation between Year and Volume. By plotting the data we can see that Volume increases over time. That means that average numbers of share traded increased from 1990 to 2010.
plot(Weekly$Volume)
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so,which ones?
weekly_glm <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(weekly_glm)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
The only predictor that is statistically significant is Lag2 with a \(p-value < alpha(0.05)\). The negative coefficient of this variable suggests that if the market had a positive return two days ago, then it is less likely to go up today.
coef(weekly_glm)
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.26686414 -0.04126894 0.05844168 -0.01606114 -0.02779021 -0.01447206
## Volume
## -0.02274153
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
weekly_probs <- predict(weekly_glm,type="response")
weekly_probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972
## 9 10
## 0.5715200 0.5554287
contrasts(Weekly$Direction)
## Up
## Down 0
## Up 1
weekly_pred <- rep("Down", 1089)
weekly_pred[weekly_probs > .5] = "Up"
table(weekly_pred, Weekly$Direction)
##
## weekly_pred Down Up
## Down 54 48
## Up 430 557
accuracy <- (54+557)/(54+48+430+557)
accuracy
## [1] 0.5610652
mean(weekly_pred == Weekly$Direction)
## [1] 0.5610652
The confusion matrix indicates that the model correctly predicted \(54 + 557 = 611\) in total out of 1089 observations which is 56.11% in accuracy.However this result is misleading because we used the entire data set to make the prediction. We need to split the data set to train the model on one part of the data and use the other part to predict.
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train <- (Year < 2009)
Weekly.2010 <- Weekly[!train,]
Direction.2010 <- Direction[!train]
dim(Weekly.2010)
## [1] 104 9
weekly_train <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
summary(weekly_train)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
weeklyProb <- predict(weekly_train, Weekly.2010, type = "response" )
weeklyPred <- rep("Down", 104)
weeklyPred[weeklyProb > .5] = "Up"
table(weeklyPred, Direction.2010)
## Direction.2010
## weeklyPred Down Up
## Down 9 5
## Up 34 56
mean(weeklyPred == Direction.2010)
## [1] 0.625
mean(weeklyPred != Direction.2010)
## [1] 0.375
(e) Repeat (d) using LDA.
weekly.lda <- lda(Direction ~ Lag2, data = Weekly, subset = train)
weekly.lda
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
plot(weekly.lda)
lda.pred <- predict(weekly.lda, Weekly.2010)
names(lda.pred)
## [1] "class" "posterior" "x"
lda.class=lda.pred$class
table(lda.class ,Direction.2010)
## Direction.2010
## lda.class Down Up
## Down 9 5
## Up 34 56
mean(lda.class == Direction.2010)
## [1] 0.625
sum(lda.pred$posterior[,1]>=.5)
## [1] 14
sum(lda.pred$posterior[,1]<.5)
## [1] 90
lda.pred$posterior[1:20,1]
## 986 987 988 989 990 991 992 993
## 0.4736555 0.3558617 0.5132860 0.5142948 0.4799727 0.4597586 0.3771117 0.5184724
## 994 995 996 997 998 999 1000 1001
## 0.5480397 0.5146118 0.5504246 0.3055404 0.4268160 0.3637275 0.4034316 0.4256310
## 1002 1003 1004 1005
## 0.4277053 0.4548626 0.4308002 0.3674066
lda.class[1:20]
## [1] Up Up Down Down Up Up Up Down Down Down Down Up Up Up Up
## [16] Up Up Up Up Up
## Levels: Down Up
sum(lda.pred$posterior[,1]>.9)
## [1] 0
(f) Repeat (d) using QDA.
weekly.qda <- qda(Direction ~ Lag2, data = Weekly, subset = train)
weekly.qda
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.pred <- predict(weekly.qda, Weekly.2010)
names(qda.pred)
## [1] "class" "posterior"
qda.class=qda.pred$class
table(qda.class ,Direction.2010)
## Direction.2010
## qda.class Down Up
## Down 0 0
## Up 43 61
mean(qda.class == Direction.2010)
## [1] 0.5865385
(g) Repeat (d) using KNN with \(K = 1\).
train1 <- Weekly[Weekly$Year <= 2008, ]
test1 <- Weekly[Weekly$Year > 2008, ]
train.X <- data.frame(train1$Lag2)
test.X <- data.frame(test1$Lag2)
train.Direction <- train1$Direction
set.seed(1)
knn.pred <- knn(train.X,test.X,train.Direction , k=1)
table(knn.pred ,Direction.2010)
## Direction.2010
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred == Direction.2010)
## [1] 0.5
(h) Which of these methods appears to provide the best results on this data?
LDA & Logistic Regression get the same test accuracy of 0.625, so these two are better model than QDA (0.587) and KNN (0.5).
(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
weekly2 <- Weekly[-c(8)]
weekly_glm1 <- glm(Direction ~ .*., data = weekly2, family = binomial, subset = train)
summary(weekly_glm1)
##
## Call:
## glm(formula = Direction ~ . * ., family = binomial, data = weekly2,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6744 -1.2402 0.9261 1.0812 1.5783
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.934e+01 6.232e+01 0.631 0.5278
## Year -1.958e-02 3.132e-02 -0.625 0.5318
## Lag1 -3.735e+01 2.248e+01 -1.661 0.0966 .
## Lag2 -1.923e+01 2.140e+01 -0.899 0.3689
## Lag3 3.740e+00 2.142e+01 0.175 0.8614
## Lag4 -2.279e+00 2.109e+01 -0.108 0.9140
## Lag5 -9.274e+00 2.141e+01 -0.433 0.6649
## Volume 1.588e+01 7.710e+01 0.206 0.8368
## Year:Lag1 1.870e-02 1.126e-02 1.660 0.0968 .
## Year:Lag2 9.655e-03 1.072e-02 0.901 0.3678
## Year:Lag3 -1.891e-03 1.073e-02 -0.176 0.8602
## Year:Lag4 1.118e-03 1.056e-02 0.106 0.9158
## Year:Lag5 4.611e-03 1.073e-02 0.430 0.6673
## Year:Volume -7.909e-03 3.835e-02 -0.206 0.8366
## Lag1:Lag2 -6.246e-03 1.134e-02 -0.551 0.5818
## Lag1:Lag3 8.394e-03 1.264e-02 0.664 0.5068
## Lag1:Lag4 -8.752e-03 1.062e-02 -0.824 0.4098
## Lag1:Lag5 -1.613e-02 1.233e-02 -1.307 0.1911
## Lag1:Volume -7.181e-02 3.812e-02 -1.884 0.0596 .
## Lag2:Lag3 4.234e-03 1.087e-02 0.389 0.6969
## Lag2:Lag4 6.066e-04 1.237e-02 0.049 0.9609
## Lag2:Lag5 -6.461e-03 1.084e-02 -0.596 0.5512
## Lag2:Volume -1.005e-02 3.401e-02 -0.295 0.7677
## Lag3:Lag4 1.721e-02 1.044e-02 1.649 0.0992 .
## Lag3:Lag5 -1.199e-02 1.212e-02 -0.989 0.3227
## Lag3:Volume 2.267e-02 3.652e-02 0.621 0.5348
## Lag4:Lag5 -3.819e-03 1.090e-02 -0.350 0.7260
## Lag4:Volume 6.133e-03 3.449e-02 0.178 0.8589
## Lag5:Volume 1.381e-02 3.743e-02 0.369 0.7120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1327.2 on 956 degrees of freedom
## AIC: 1385.2
##
## Number of Fisher Scoring iterations: 5
weekly_probs2 <- predict(weekly_glm1,Weekly.2010, type="response")
weekly_Pred2 <- rep("Down", 104)
weekly_Pred2[weekly_probs2 > .5] = "Up"
table(weekly_Pred2, Direction.2010)
## Direction.2010
## weekly_Pred2 Down Up
## Down 23 25
## Up 20 36
mean(weekly_Pred2 == Direction.2010)
## [1] 0.5673077
weekly_glm2 <- step(weekly_glm1,
direction="backward",test="Chisq", trace = F)
summary(weekly_glm2)
##
## Call:
## glm(formula = Direction ~ Year + Lag1 + Lag2 + Lag3 + Lag4 +
## Lag5 + Volume + Year:Lag1 + Lag1:Lag5 + Lag1:Volume + Lag3:Lag4,
## family = binomial, data = weekly2, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6940 -1.2482 0.9681 1.0859 1.7791
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.159567 44.275401 0.071 0.9431
## Year -0.001412 0.022200 -0.064 0.9493
## Lag1 -32.629712 21.173970 -1.541 0.1233
## Lag2 0.047337 0.030839 1.535 0.1248
## Lag3 -0.004278 0.031828 -0.134 0.8931
## Lag4 -0.037207 0.031209 -1.192 0.2332
## Lag5 -0.039435 0.029950 -1.317 0.1879
## Volume -0.092176 0.104161 -0.885 0.3762
## Year:Lag1 0.016334 0.010608 1.540 0.1236
## Lag1:Lag5 -0.016050 0.011114 -1.444 0.1487
## Lag1:Volume -0.060453 0.033451 -1.807 0.0707 .
## Lag3:Lag4 0.014767 0.008953 1.649 0.0991 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1334.2 on 973 degrees of freedom
## AIC: 1358.2
##
## Number of Fisher Scoring iterations: 5
weekly_probs3 <- predict(weekly_glm2,Weekly.2010, type="response")
weekly_Pred3 <- rep("Down", 104)
weekly_Pred3[weekly_probs3 > .5] = "Up"
table(weekly_Pred3, Direction.2010)
## Direction.2010
## weekly_Pred3 Down Up
## Down 28 42
## Up 15 19
mean(weekly_Pred3 == Direction.2010)
## [1] 0.4519231
This model did not improved with different combination of predictors and interactions.
weekly.lda1 <- lda(Direction ~ .*., data = weekly2, subset = train)
weekly.lda1
## Call:
## lda(Direction ~ . * ., data = weekly2, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Year Lag1 Lag2 Lag3 Lag4 Lag5
## Down 1999.295 0.289444444 -0.03568254 0.17080045 0.15925624 0.21409297
## Up 1998.853 -0.009213235 0.26036581 0.08404044 0.09220956 0.04548897
## Volume Year:Lag1 Year:Lag2 Year:Lag3 Year:Lag4 Year:Lag5 Year:Volume
## Down 1.266966 578.19599 -72.78371 340.8003 317.5336 426.84087 2538.931
## Up 1.156529 -19.47546 520.28269 167.2162 183.6835 90.64494 2317.141
## Lag1:Lag2 Lag1:Lag3 Lag1:Lag4 Lag1:Lag5 Lag1:Volume Lag2:Lag3
## Down -0.8014495 -0.006554329 -0.4724269 0.3615693 0.3362336 -0.1937158
## Up -0.1393632 0.687742739 -0.2036205 -0.1833118 -0.6187543 -0.6405132
## Lag2:Lag4 Lag2:Lag5 Lag2:Volume Lag3:Lag4 Lag3:Lag5 Lag3:Volume
## Down 0.78824608 -0.3132494 -0.70124208 -0.991496916 0.4913572 -0.1417243
## Up 0.04407141 -0.3497535 0.06257301 0.007162822 0.3137873 -0.2956207
## Lag4:Lag5 Lag4:Volume Lag5:Volume
## Down -0.3686248 -0.2257271 -0.2323645
## Up -0.4945795 -0.1677342 -0.1401411
##
## Coefficients of linear discriminants:
## LD1
## Year -6.050854e-02
## Lag1 -8.225884e+01
## Lag2 -5.381683e+01
## Lag3 1.540054e+01
## Lag4 -6.628949e-01
## Lag5 -3.280721e+01
## Volume 4.457784e+01
## Year:Lag1 4.115458e-02
## Year:Lag2 2.702039e-02
## Year:Lag3 -7.766719e-03
## Year:Lag4 2.585077e-04
## Year:Lag5 1.632926e-02
## Year:Volume -2.219095e-02
## Lag1:Lag2 -2.686924e-02
## Lag1:Lag3 1.823994e-02
## Lag1:Lag4 -2.502826e-02
## Lag1:Lag5 -4.121845e-02
## Lag1:Volume -1.413024e-01
## Lag2:Lag3 1.314428e-02
## Lag2:Lag4 -3.126230e-03
## Lag2:Lag5 -1.322231e-02
## Lag2:Volume -2.089106e-02
## Lag3:Lag4 4.962695e-02
## Lag3:Lag5 -3.042063e-02
## Lag3:Volume 7.665918e-02
## Lag4:Lag5 -1.219883e-02
## Lag4:Volume 3.193697e-02
## Lag5:Volume 3.182095e-02
lda.pred1 <- predict(weekly.lda1, Weekly.2010)
lda.class1=lda.pred1$class
table(lda.class1 ,Direction.2010)
## Direction.2010
## lda.class1 Down Up
## Down 22 23
## Up 21 38
mean(lda.class1 == Direction.2010)
## [1] 0.5769231
This model did not improved with different combination of predictors and interactions.
weekly.qda1 <- qda(Direction ~ .*., data = weekly2, subset = train)
weekly.qda1
## Call:
## qda(Direction ~ . * ., data = weekly2, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Year Lag1 Lag2 Lag3 Lag4 Lag5
## Down 1999.295 0.289444444 -0.03568254 0.17080045 0.15925624 0.21409297
## Up 1998.853 -0.009213235 0.26036581 0.08404044 0.09220956 0.04548897
## Volume Year:Lag1 Year:Lag2 Year:Lag3 Year:Lag4 Year:Lag5 Year:Volume
## Down 1.266966 578.19599 -72.78371 340.8003 317.5336 426.84087 2538.931
## Up 1.156529 -19.47546 520.28269 167.2162 183.6835 90.64494 2317.141
## Lag1:Lag2 Lag1:Lag3 Lag1:Lag4 Lag1:Lag5 Lag1:Volume Lag2:Lag3
## Down -0.8014495 -0.006554329 -0.4724269 0.3615693 0.3362336 -0.1937158
## Up -0.1393632 0.687742739 -0.2036205 -0.1833118 -0.6187543 -0.6405132
## Lag2:Lag4 Lag2:Lag5 Lag2:Volume Lag3:Lag4 Lag3:Lag5 Lag3:Volume
## Down 0.78824608 -0.3132494 -0.70124208 -0.991496916 0.4913572 -0.1417243
## Up 0.04407141 -0.3497535 0.06257301 0.007162822 0.3137873 -0.2956207
## Lag4:Lag5 Lag4:Volume Lag5:Volume
## Down -0.3686248 -0.2257271 -0.2323645
## Up -0.4945795 -0.1677342 -0.1401411
qda.pred1 <- predict(weekly.qda1, Weekly.2010)
qda.class1=qda.pred1$class
table(qda.class1 ,Direction.2010)
## Direction.2010
## qda.class1 Down Up
## Down 28 39
## Up 15 22
mean(qda.class1 == Direction.2010)
## [1] 0.4807692
This model did not improved with different combination of predictors and interactions.
set.seed(1)
knn.pred2 <- knn(train.X,test.X,train.Direction , k=4)
table(knn.pred2 ,Direction.2010)
## Direction.2010
## knn.pred2 Down Up
## Down 20 17
## Up 23 44
mean(knn.pred2 == Direction.2010)
## [1] 0.6153846
set.seed(1)
knn.pred3 <- knn(train.X,test.X,train.Direction , k=5)
table(knn.pred3 ,Direction.2010)
## Direction.2010
## knn.pred3 Down Up
## Down 16 21
## Up 27 40
mean(knn.pred3 == Direction.2010)
## [1] 0.5384615
For the KNN model I used the original train and test from the previous exercise, however, I change the K values. The model improve with \(K = 4\), but started to decreased with \(K > 5\).
Auto data set.(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
auto <- Auto
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
auto$mpg01 <- factor(as.numeric(auto$mpg > median(auto$mpg)))
str(auto)
## 'data.frame': 392 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ mpg01 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
library(ggplot2)
auto$name <- NULL
auto$origin <- factor(auto$origin, labels = c("American", "European", "Japanese"))
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : Factor w/ 3 levels "American","European",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ mpg01 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
pairs(auto)
ggplot(auto, aes(x = mpg01, y = cylinders, col = mpg01)) +
geom_jitter() +
ggtitle("Cylinders vs mpg01 - Scatterplot")
ggplot(auto, aes(x = mpg01, y = cylinders, fill = mpg01)) +
geom_boxplot() +
ggtitle("Cylinders vs mpg01 - Boxplot")
ggplot(auto, aes(x = mpg01, y = displacement, fill = mpg01)) +
geom_boxplot() +
ggtitle("Displacement vs mpg01 - Boxplot")
ggplot(auto, aes(x = mpg01, y = displacement, col = mpg01)) +
geom_jitter() +
ggtitle("Displacement vs mpg01 - Scatterplot")
ggplot(auto, aes(x = mpg01, y = horsepower, fill = mpg01)) +
geom_boxplot() +
ggtitle("Horsepower vs mpg01 - Boxplot")
ggplot(auto, aes(x = mpg01, y = weight, fill = mpg01)) +
geom_boxplot() +
ggtitle("Weight vs mpg01 - Boxplot")
ggplot(auto, aes(x = mpg01, y = acceleration, fill = mpg01)) +
geom_boxplot() +
ggtitle("Acceleration vs mpg01 - Boxplot")
ggplot(auto, aes(x = mpg01, y = year, fill = mpg01)) +
geom_boxplot() +
ggtitle("Year vs mpg01 - Boxplot")
ggplot(auto, aes(x = origin, fill = mpg01)) +
geom_bar() +
ggtitle("Origin vs mpg01 - Bar")
Because mpg1 is binary is hard to see the predicting power of the variable through the scatterplot or boxplot.However cylinders + displacement + horsepower + weight + acceleration could be use to train the model.
(c) Split the data into a training set and a test set.
set.seed(1)
tr_ind <- sample(nrow(auto),0.8*nrow(auto),replace = F)
auto_train <- auto[tr_ind,]
auto_test <- auto[-tr_ind,]
dim(auto_train)
## [1] 313 9
dim(auto_test)
## [1] 79 9
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
auto_lda <- lda(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration, data = auto_train)
(auto_lda)
## Call:
## lda(mpg01 ~ cylinders + displacement + horsepower + weight +
## acceleration, data = auto_train)
##
## Prior probabilities of groups:
## 0 1
## 0.4920128 0.5079872
##
## Group means:
## cylinders displacement horsepower weight acceleration
## 0 6.753247 269.6558 128.0844 3598.026 14.76104
## 1 4.207547 117.5723 79.8239 2350.283 16.49057
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.3683711581
## displacement -0.0034011244
## horsepower 0.0043408694
## weight -0.0009441519
## acceleration 0.0002113197
auto_lda.pred <- predict(auto_lda, newdata = auto_test)
auto_lda.class=auto_lda.pred$class
table(auto_lda.class ,auto_test$mpg01)
##
## auto_lda.class 0 1
## 0 35 0
## 1 7 37
mean(auto_lda.class != auto_test$mpg01)
## [1] 0.08860759
mean(auto_lda.class == auto_test$mpg01)
## [1] 0.9113924
Test error of the model obtained is 0.0886
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
auto_qda <- qda(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration, data = auto_train)
(auto_qda)
## Call:
## qda(mpg01 ~ cylinders + displacement + horsepower + weight +
## acceleration, data = auto_train)
##
## Prior probabilities of groups:
## 0 1
## 0.4920128 0.5079872
##
## Group means:
## cylinders displacement horsepower weight acceleration
## 0 6.753247 269.6558 128.0844 3598.026 14.76104
## 1 4.207547 117.5723 79.8239 2350.283 16.49057
auto_qda.pred <- predict(auto_qda, newdata = auto_test)
auto_qda.class <- auto_qda.pred$class
table(auto_qda.class ,auto_test$mpg01)
##
## auto_qda.class 0 1
## 0 37 2
## 1 5 35
mean(auto_qda.class != auto_test$mpg01)
## [1] 0.08860759
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
auto_glm <- glm(mpg01 ~ cylinders + displacement + horsepower + weight + acceleration, data = auto_train, family = binomial)
summary(auto_glm)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight + acceleration, family = binomial, data = auto_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4243 -0.2259 0.1234 0.4160 3.2483
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.8893784 2.9794937 3.990 6.6e-05 ***
## cylinders 0.0527884 0.3685127 0.143 0.8861
## displacement -0.0114217 0.0087598 -1.304 0.1923
## horsepower -0.0403983 0.0224690 -1.798 0.0722 .
## weight -0.0020999 0.0009563 -2.196 0.0281 *
## acceleration -0.0198422 0.1349911 -0.147 0.8831
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 433.83 on 312 degrees of freedom
## Residual deviance: 175.11 on 307 degrees of freedom
## AIC: 187.11
##
## Number of Fisher Scoring iterations: 7
auto_glm.prob <- predict(auto_glm, newdata = auto_test, type="response")
auto_glm.pred <- rep("0", 79)
auto_glm.pred[auto_glm.prob > .5] = "1"
table(auto_glm.pred, auto_test$mpg01)
##
## auto_glm.pred 0 1
## 0 38 1
## 1 4 36
mean(auto_glm.pred != auto_test$mpg01)
## [1] 0.06329114
(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
auto_train.X = data.frame(auto_train$cylinders, auto_train$displacement, auto_train$horsepower, auto_train$weight, auto_train$acceleration)
auto_test.X = data.frame(auto_test$cylinders, auto_test$displacement, auto_test$horsepower, auto_test$weight, auto_test$acceleration)
train.mpg01 = auto_train$mpg01
set.seed(1)
auto_knn.pred = knn(auto_train.X,auto_test.X,train.mpg01 ,k=1)
table(auto_knn.pred ,auto_test$mpg01)
##
## auto_knn.pred 0 1
## 0 36 4
## 1 6 33
mean(auto_knn.pred != auto_test$mpg01)
## [1] 0.1265823
set.seed(1)
auto_knn.pred1 = knn(auto_train.X,auto_test.X,train.mpg01 ,k=3)
table(auto_knn.pred1 ,auto_test$mpg01)
##
## auto_knn.pred1 0 1
## 0 38 4
## 1 4 33
mean(auto_knn.pred1 != auto_test$mpg01)
## [1] 0.1012658
set.seed(1)
auto_knn.pred2 = knn(auto_train.X,auto_test.X,train.mpg01 ,k=4)
table(auto_knn.pred2 ,auto_test$mpg01)
##
## auto_knn.pred2 0 1
## 0 39 5
## 1 3 32
mean(auto_knn.pred2 != auto_test$mpg01)
## [1] 0.1012658
The test error obtain with k = 4 is the best, after K > 4, the model does not improve anymore.
Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.Boston.df <- Boston
str(Boston.df)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston.df$crim <- factor(ifelse(Boston.df$crim > median(Boston.df$crim), 1, 0))
str(Boston.df)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
set.seed(1)
tr_index <- sample(nrow(Boston.df),0.8*nrow(Boston.df),replace = F)
Boston_train <- Boston.df[tr_ind,]
Boston_test <- Boston.df[-tr_ind,]
dim(Boston_train)
## [1] 313 14
dim(Boston_test)
## [1] 193 14
Boston_glm <- glm(crim ~ ., data = Boston_train, family = binomial)
summary(Boston_glm)
##
## Call:
## glm(formula = crim ~ ., family = binomial, data = Boston_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1506 -0.1685 -0.0189 0.1251 3.5854
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.065e+01 8.473e+00 -4.797 1.61e-06 ***
## zn -6.112e-02 4.170e-02 -1.466 0.14272
## indus -1.019e-01 6.072e-02 -1.678 0.09344 .
## chas 1.043e+00 8.597e-01 1.214 0.22490
## nox 6.102e+01 1.102e+01 5.535 3.11e-08 ***
## rm -1.410e+00 9.040e-01 -1.560 0.11885
## age 3.340e-02 1.617e-02 2.066 0.03887 *
## dis 7.721e-01 2.868e-01 2.692 0.00710 **
## rad 6.945e-01 2.182e-01 3.183 0.00146 **
## tax -3.567e-03 4.523e-03 -0.789 0.43031
## ptratio 4.649e-01 1.569e-01 2.962 0.00306 **
## black -8.374e-03 5.690e-03 -1.472 0.14108
## lstat 9.342e-04 6.378e-02 0.015 0.98831
## medv 2.168e-01 9.138e-02 2.373 0.01764 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 421.14 on 312 degrees of freedom
## Residual deviance: 144.39 on 299 degrees of freedom
## AIC: 172.39
##
## Number of Fisher Scoring iterations: 9
Boston_glm.prob <- predict(Boston_glm, newdata = Boston_test, type="response")
Boston_glm.pred <- rep("0", 193)
Boston_glm.pred[Boston_glm.prob > .5] = "1"
table(Boston_glm.pred, Boston_test$crim)
##
## Boston_glm.pred 0 1
## 0 56 6
## 1 9 122
mean(Boston_glm.pred == Boston_test$crim)
## [1] 0.9222798
Boston_lda <- lda(crim ~ ., data = Boston_train)
(Boston_lda)
## Call:
## lda(crim ~ ., data = Boston_train)
##
## Prior probabilities of groups:
## 0 1
## 0.600639 0.399361
##
## Group means:
## zn indus chas nox rm age dis rad
## 0 21.24734 6.55016 0.05319149 0.4651479 6.422298 50.07074 5.163038 4.154255
## 1 2.09600 13.36864 0.16000000 0.6193920 6.227088 85.72720 2.675730 9.904000
## tax ptratio black lstat medv
## 0 294.3777 17.87287 388.3455 9.214681 25.32926
## 1 420.2400 18.42400 370.4190 13.777120 23.54480
##
## Coefficients of linear discriminants:
## LD1
## zn -0.0031146382
## indus -0.0041399657
## chas 0.0715290350
## nox 10.2734899792
## rm 0.2483535133
## age 0.0140322957
## dis 0.0504023906
## rad 0.0105425620
## tax 0.0007134144
## ptratio 0.0965812895
## black -0.0046669146
## lstat 0.0137673862
## medv 0.0265661070
Boston_lda.pred <- predict(Boston_lda, newdata = Boston_test)
Boston_lda.class <- Boston_lda.pred$class
table(Boston_lda.class ,Boston_test$crim)
##
## Boston_lda.class 0 1
## 0 52 10
## 1 13 118
mean(Boston_lda.class == Boston_test$crim)
## [1] 0.880829
Boston_qda <- qda(crim ~ ., data = Boston_train)
(Boston_qda)
## Call:
## qda(crim ~ ., data = Boston_train)
##
## Prior probabilities of groups:
## 0 1
## 0.600639 0.399361
##
## Group means:
## zn indus chas nox rm age dis rad
## 0 21.24734 6.55016 0.05319149 0.4651479 6.422298 50.07074 5.163038 4.154255
## 1 2.09600 13.36864 0.16000000 0.6193920 6.227088 85.72720 2.675730 9.904000
## tax ptratio black lstat medv
## 0 294.3777 17.87287 388.3455 9.214681 25.32926
## 1 420.2400 18.42400 370.4190 13.777120 23.54480
Boston_qda.pred <- predict(Boston_qda, newdata = Boston_test)
Boston_qda.class <- Boston_qda.pred$class
table(Boston_qda.class ,Boston_test$crim)
##
## Boston_qda.class 0 1
## 0 61 11
## 1 4 117
mean(Boston_qda.class == Boston_test$crim)
## [1] 0.9222798
Boston_train.X = data.frame(Boston_train)
Boston_test.X = data.frame(Boston_test)
train.crim = Boston_train$crim
set.seed(1)
Boston_knn.pred = knn(Boston_train.X, Boston_test.X, train.crim ,k=1)
table(Boston_knn.pred ,Boston_test$crim)
##
## Boston_knn.pred 0 1
## 0 56 5
## 1 9 123
mean(Boston_knn.pred == Boston_test$crim)
## [1] 0.9274611
I fitted all 4 models, using all the predictors each time. The best model was KNN with an accuracy of 0.9274.