This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
?Weekly
## starting httpd help server ... done
str(`Weekly`)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
pairs(Weekly)
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
attach(`Weekly`)
plot(Volume)
Year and Volume have a strong positive relationship, confirmed by a correlation value of ~0.84.
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results.
glm.fits=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly,family=binomial )
summary (glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Do any of the predictors appear to be statistically significant? If so, which ones?
Lag2 is the only predictor that is statistically significant to Direction.
(c) Compute the confusion matrix and overall fraction of correct predictions.
contrasts(Direction)
## Up
## Down 0
## Up 1
glm.probs=predict(glm.fits,type="response")
glm.probs [1:10]
## 1 2 3 4 5 6 7 8
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972
## 9 10
## 0.5715200 0.5554287
glm.pred=rep("Down", 1089)
glm.pred[glm.probs >.5]=" Up"
table(glm.pred ,Direction)
## Direction
## glm.pred Down Up
## Up 430 557
## Down 54 48
(54+557)/1089
## [1] 0.5610652
Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
The model correctly predicted the weekly market trend 56.1% of the time. This says that our model incorrectly predicted that it would go up 48 times and incorrectly predicted that it would go down 430 times. This means that the training error rate(which is overly optimistic) is 43.9.
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor.
Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train=(Year<2009)
Weekly.910= Weekly[!train ,]
Direction.910= Direction[!train]
dim(Weekly.910)
## [1] 104 9
glm.fits=glm(Direction~Lag2, data=Weekly,family=binomial ,subset=train)
glm.probs=predict (glm.fits,Weekly.910, type="response")
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred,Direction.910)
## Direction.910
## glm.pred Down Up
## Down 9 5
## Up 34 56
mean(glm.pred==Direction.910)
## [1] 0.625
The new logistic model correctly predicted weekly trends 62.5% of the time. This is an improvement from the 56.1% rate.
(e) Repeat (d) using LDA.
library(MASS)
lda.fit<-lda(Direction~Lag2, data=Weekly, subset=train)
lda.pred<-predict(lda.fit, Weekly.910)
table(lda.pred$class, Direction.910)
## Direction.910
## Down Up
## Down 9 5
## Up 34 56
mean(lda.pred$class==Direction.910)
## [1] 0.625
The LDA model also correctly predicted weekly trends 62.5% of the time.
(f) Repeat (d) using QDA.
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly.910)$class
table(qda.pred, Direction.910)
## Direction.910
## qda.pred Down Up
## Down 0 0
## Up 43 61
mean(qda.pred==Direction.910)
## [1] 0.5865385
The QDA model has a lower accuracy rate than the previous models with a 58.65% accuracy rate. Notice that only the correct/incorrect weekly upward trends were counted.
(g) Repeat (d) using KNN with K = 1.
library(class)
Week.train=as.matrix(Lag2[train])
Week.test=as.matrix(Lag2[!train])
train.Direction =Direction[train]
set.seed(1)
Weekknn.pred=knn(Week.train,Week.test,train.Direction,k=1)
table(Weekknn.pred,Direction.910)
## Direction.910
## Weekknn.pred Down Up
## Down 21 30
## Up 22 31
mean(Weekknn.pred == Direction.910)
## [1] 0.5
Our KNN model has the lowest accuracy rate of all the models at 50%.
(h) Which of these methods appears to provide the best results on this data?
The Logistic Regression and the Linear Discriminant Analysis models both have the best accuracy rates at 62.5%.
(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods.
Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
Logistic Regression with Interactions and Transformations
#Logistic Regression with Interaction Lag2: Lag2+Lag1
glm.fits=glm(Direction~Lag2:Lag2+Lag1, data=Weekly,family=binomial ,subset=train)
glm.probs=predict (glm.fits,Weekly.910, type="response")
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred,Direction.910)
## Direction.910
## glm.pred Down Up
## Down 7 8
## Up 36 53
mean(glm.pred==Direction.910)
## [1] 0.5769231
#Logistic Regression with Interaction Lag2:Lag2+Lag1(transformation)
glm.fits=glm(Direction~Lag2:Lag2+I(Lag1^2), data=Weekly,family=binomial ,subset=train)
glm.probs=predict (glm.fits,Weekly.910, type="response")
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred,Direction.910)
## Direction.910
## glm.pred Down Up
## Down 8 2
## Up 35 59
mean(glm.pred==Direction.910)
## [1] 0.6442308
#Logistic Regression with Interaction Lag2:Lag3+Lag2
glm.fits=glm(Direction~Lag2:Lag3+Lag2, data=Weekly,family=binomial ,subset=train)
glm.probs=predict (glm.fits,Weekly.910, type="response")
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred,Direction.910)
## Direction.910
## glm.pred Down Up
## Down 9 5
## Up 34 56
mean(glm.pred==Direction.910)
## [1] 0.625
#Logistic Regression with Interaction Lag2:Lag2+Lag3(transformation)
glm.fits=glm(Direction~Lag2:Lag2+I(Lag3^2), data=Weekly,family=binomial ,subset=train)
glm.probs=predict (glm.fits,Weekly.910, type="response")
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred,Direction.910)
## Direction.910
## glm.pred Down Up
## Down 8 5
## Up 35 56
mean(glm.pred==Direction.910)
## [1] 0.6153846
#Logistic Regression with Interaction Lag2:Lag2+Lag4
glm.fits=glm(Direction~Lag2:Lag2+Lag4, data=Weekly,family=binomial ,subset=train)
glm.probs=predict (glm.fits,Weekly.910, type="response")
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred,Direction.910)
## Direction.910
## glm.pred Down Up
## Down 8 4
## Up 35 57
mean(glm.pred==Direction.910)
## [1] 0.625
#Logistic Regression with Interaction Lag2:Lag2+Lag4(transformation)
glm.fits=glm(Direction~Lag2:Lag2+I(Lag4^2), data=Weekly,family=binomial ,subset=train)
glm.probs=predict (glm.fits,Weekly.910, type="response")
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred,Direction.910)
## Direction.910
## glm.pred Down Up
## Down 8 5
## Up 35 56
mean(glm.pred==Direction.910)
## [1] 0.6153846
#Logistic Regression with Interaction Lag2:Lag2+Lag5
glm.fits=glm(Direction~Lag2:Lag2+Lag5, data=Weekly,family=binomial ,subset=train)
glm.probs=predict (glm.fits,Weekly.910, type="response")
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred,Direction.910)
## Direction.910
## glm.pred Down Up
## Down 7 5
## Up 36 56
mean(glm.pred==Direction.910)
## [1] 0.6057692
#Logistic Regression with Interaction Lag2:Lag2+Lag5(transformation)
glm.fits=glm(Direction~Lag2:Lag2+I(Lag5^2), data=Weekly,family=binomial ,subset=train)
glm.probs=predict (glm.fits,Weekly.910, type="response")
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred,Direction.910)
## Direction.910
## glm.pred Down Up
## Down 9 7
## Up 34 54
mean(glm.pred==Direction.910)
## [1] 0.6057692
LDA with Interaction & Transformations
lda.fit<-lda(Direction~Lag2:Lag2+Lag1, data=Weekly, subset=train)
lda.pred<-predict(lda.fit, Weekly.910)
table(lda.pred$class, Direction.910)
## Direction.910
## Down Up
## Down 7 8
## Up 36 53
mean(lda.pred$class==Direction.910)
## [1] 0.5769231
lda.fit<-lda(Direction~Lag2:Lag2+I(Lag1^2), data=Weekly, subset=train)
lda.pred<-predict(lda.fit, Weekly.910)
table(lda.pred$class, Direction.910)
## Direction.910
## Down Up
## Down 8 2
## Up 35 59
mean(lda.pred$class==Direction.910)
## [1] 0.6442308
lda.fit<-lda(Direction~Lag2:Lag2+Lag3, data=Weekly, subset=train)
lda.pred<-predict(lda.fit, Weekly.910)
table(lda.pred$class, Direction.910)
## Direction.910
## Down Up
## Down 8 4
## Up 35 57
mean(lda.pred$class==Direction.910)
## [1] 0.625
lda.fit<-lda(Direction~Lag2:Lag2+I(Lag3^2), data=Weekly, subset=train)
lda.pred<-predict(lda.fit, Weekly.910)
table(lda.pred$class, Direction.910)
## Direction.910
## Down Up
## Down 9 5
## Up 34 56
mean(lda.pred$class==Direction.910)
## [1] 0.625
lda.fit<-lda(Direction~Lag2:Lag2+Lag4, data=Weekly, subset=train)
lda.pred<-predict(lda.fit, Weekly.910)
table(lda.pred$class, Direction.910)
## Direction.910
## Down Up
## Down 8 4
## Up 35 57
mean(lda.pred$class==Direction.910)
## [1] 0.625
lda.fit<-lda(Direction~Lag2:Lag2+I(Lag4^2), data=Weekly, subset=train)
lda.pred<-predict(lda.fit, Weekly.910)
table(lda.pred$class, Direction.910)
## Direction.910
## Down Up
## Down 8 5
## Up 35 56
mean(lda.pred$class==Direction.910)
## [1] 0.6153846
lda.fit<-lda(Direction~Lag2+Lag5, data=Weekly, subset=train)
lda.pred<-predict(lda.fit, Weekly.910)
table(lda.pred$class, Direction.910)
## Direction.910
## Down Up
## Down 6 5
## Up 37 56
mean(lda.pred$class==Direction.910)
## [1] 0.5961538
lda.fit<-lda(Direction~Lag2:Lag2+I(Lag5^2), data=Weekly, subset=train)
lda.pred<-predict(lda.fit, Weekly.910)
table(lda.pred$class, Direction.910)
## Direction.910
## Down Up
## Down 9 7
## Up 34 54
mean(lda.pred$class==Direction.910)
## [1] 0.6057692
QDA Interactions & Transformations
qda.fit = qda(Direction~Lag2:Lag2+Lag1, data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly.910)$class
table(qda.pred, Direction.910)
## Direction.910
## qda.pred Down Up
## Down 7 10
## Up 36 51
mean(qda.pred==Direction.910)
## [1] 0.5576923
qda.fit = qda(Direction~Lag2:Lag2+I(Lag1^2), data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly.910)$class
table(qda.pred, Direction.910)
## Direction.910
## qda.pred Down Up
## Down 32 40
## Up 11 21
mean(qda.pred==Direction.910)
## [1] 0.5096154
qda.fit = qda(Direction~Lag2:Lag2+Lag3, data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly.910)$class
table(qda.pred, Direction.910)
## Direction.910
## qda.pred Down Up
## Down 4 2
## Up 39 59
mean(qda.pred==Direction.910)
## [1] 0.6057692
qda.fit = qda(Direction~Lag2:Lag2+I(Lag3^2), data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly.910)$class
table(qda.pred, Direction.910)
## Direction.910
## qda.pred Down Up
## Down 33 47
## Up 10 14
mean(qda.pred==Direction.910)
## [1] 0.4519231
qda.fit = qda(Direction~Lag2:Lag2+Lag4, data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly.910)$class
table(qda.pred, Direction.910)
## Direction.910
## qda.pred Down Up
## Down 9 14
## Up 34 47
mean(qda.pred==Direction.910)
## [1] 0.5384615
qda.fit = qda(Direction~Lag2:Lag2+I(Lag4^2), data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly.910)$class
table(qda.pred, Direction.910)
## Direction.910
## qda.pred Down Up
## Down 7 9
## Up 36 52
mean(qda.pred==Direction.910)
## [1] 0.5673077
qda.fit = qda(Direction~Lag2:Lag2+Lag5, data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly.910)$class
table(qda.pred, Direction.910)
## Direction.910
## qda.pred Down Up
## Down 3 11
## Up 40 50
mean(qda.pred==Direction.910)
## [1] 0.5096154
qda.fit = qda(Direction~Lag2:Lag2+I(Lag5^2), data = Weekly, subset = train)
qda.pred = predict(qda.fit, Weekly.910)$class
table(qda.pred, Direction.910)
## Direction.910
## qda.pred Down Up
## Down 2 10
## Up 41 51
mean(qda.pred==Direction.910)
## [1] 0.5096154
KNN
#K=10
Week.train=as.matrix(Lag2[train])
Week.test=as.matrix(Lag2[!train])
train.Direction =Direction[train]
set.seed(1)
Weekknn.pred=knn(Week.train,Week.test,train.Direction,k=10)
table(Weekknn.pred,Direction.910)
## Direction.910
## Weekknn.pred Down Up
## Down 17 21
## Up 26 40
mean(Weekknn.pred == Direction.910)
## [1] 0.5480769
#K=50
Week.train=as.matrix(Lag2[train])
Week.test=as.matrix(Lag2[!train])
train.Direction =Direction[train]
set.seed(1)
Weekknn.pred=knn(Week.train,Week.test,train.Direction,k=50)
table(Weekknn.pred,Direction.910)
## Direction.910
## Weekknn.pred Down Up
## Down 20 23
## Up 23 38
mean(Weekknn.pred == Direction.910)
## [1] 0.5576923
#K=100
Week.train=as.matrix(Lag2[train])
Week.test=as.matrix(Lag2[!train])
train.Direction =Direction[train]
set.seed(1)
Weekknn.pred=knn(Week.train,Week.test,train.Direction,k=100)
table(Weekknn.pred,Direction.910)
## Direction.910
## Weekknn.pred Down Up
## Down 10 11
## Up 33 50
mean(Weekknn.pred == Direction.910)
## [1] 0.5769231
The interaction between variables Lag2 and Lag1 shows an accuracy rate of 57.7% for our first logistic model. The interaction between variables Lag2 and the quadratic term of Lag1 shows an accuracy rate of 64.4%. That is a ~6.7% increase. This rate is the highest accuracy rate we have achieved for the reduced data set.The interaction between Lag2 and Lag3 for the logistic model has a 62.5% accuracy rate. The same rate of the previous logistic model with just Lag2 as a predictor. The same can be said for the interaction between Lag2 and Lag4.
In the LDA model we see identical results of the logistic regression for the first two models(57.7% to 64.4%). Including the Lag4 transformation reduced the model accuracy rate. However,including the transformation of the lag5 variable increased the accuracy rate from 59.6 to 60.6.
For the QDA the highest accuracy rate was from the interaction between Lag2 and Lag3 at 60.6%. For the QDA the highest accuracy rate was from the interaction between Lag2 and Lag2+Lag3 at 60.6%. Most of the transformations actually lowered the accuracy rate. The interaction between Lag2 and Lag2 plus the transformation of Lag4 increased the accuracy rate from 53.8 to 56.7, while the lag5 combination was the only variable unaffected by the transformation.
We can see that with the KNN models, as K increases so does our accuracy rate. At K=100 we end up with an accuracy rate of 57.69%. This combination only increased our rate 2.89% from 54.8 at k=10. So, the best model(s) to use would be either the logistic or LDA model with an interaction between Lag2 and Lag2 plus the quadratic term of Lag1.
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median.
You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
attach(Auto)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
mpg01 <- rep(0, length(mpg))
mpg01[mpg > median(mpg)] <- 1
Auto = data.frame(Auto, mpg01)
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features.
Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
pairs(Auto)
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
MPG01 has a strong negative correlation to cylinders, displacement, horsepower,and weight. We also see a moderate correlation with origin and year.
(c) Split the data into a training set and a test set.
train <- (year %% 2 == 0)
train.auto <- Auto[train,]
test.auto <- Auto[-train,]
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
autolda.fit <- lda(mpg01~cylinders+displacement+horsepower+weight+year++origin, data=train.auto)
autolda.pred <- predict(autolda.fit, test.auto)
table(autolda.pred$class, test.auto$mpg01)
##
## 0 1
## 0 169 7
## 1 26 189
mean(autolda.pred$class != test.auto$mpg01)
## [1] 0.08439898
The test error rate for the Auto LDA model is 8.44%.
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
autoqda.fit <- qda(mpg01~displacement+horsepower+weight+year+cylinders+origin, data=train.auto)
autoqda.pred <- predict(autoqda.fit, test.auto)
table(autoqda.pred$class, test.auto$mpg01)
##
## 0 1
## 0 176 20
## 1 19 176
mean(autoqda.pred$class != test.auto$mpg01)
## [1] 0.09974425
Our test error rate for our QDA model is 9.97%.
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
auto.fit<-glm(mpg01~displacement+horsepower+weight+year+cylinders+origin, data=train.auto,family=binomial)
auto.probs = predict(auto.fit, test.auto, type = "response")
auto.pred = rep(0, length(auto.probs))
auto.pred[auto.probs > 0.5] = 1
table(auto.pred, test.auto$mpg01)
##
## auto.pred 0 1
## 0 174 12
## 1 21 184
mean(auto.pred != test.auto$mpg01)
## [1] 0.08439898
For the logistic model we have a test error rate of 8.44%.
(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
#K=1
train.K= cbind(displacement,horsepower,weight,cylinders,year, origin)[train,]
test.K=cbind(displacement,horsepower,weight,cylinders, year, origin)[-train,]
set.seed(1)
autok.pred=knn(train.K,test.K,train.auto$mpg01,k=1)
mean(autok.pred != test.auto$mpg01)
## [1] 0.07161125
#K=5
autok.pred=knn(train.K,test.K,train.auto$mpg01,k=10)
mean(autok.pred != test.auto$mpg01)
## [1] 0.1253197
#K=10
autok.pred=knn(train.K,test.K,train.auto$mpg01,k=15)
mean(autok.pred != test.auto$mpg01)
## [1] 0.1355499
The lowest test error rate is when k=1 at 7.16%. It seems that as k increases so does the test error rate.
13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
attach(Boston)
#creating a binary crim variable
crime01 = rep(0, length(crim))
crime01[crim > median(crim)]= 1
Boston= data.frame(Boston,crime01)
#splitting the dataset
train = 1:(dim(Boston)[1]/2)
test = (dim(Boston)[1]/2 + 1):dim(Boston)[1]
Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
crime01.test = crime01[test]
pairs(Boston)
cor(Boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## crime01 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## crime01 -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128 0.2535684
## black lstat medv crime01
## crim -0.38506394 0.4556215 -0.3883046 0.40939545
## zn 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus -0.35697654 0.6037997 -0.4837252 0.60326017
## chas 0.04878848 -0.0539293 0.1752602 0.07009677
## nox -0.38005064 0.5908789 -0.4273208 0.72323480
## rm 0.12806864 -0.6138083 0.6953599 -0.15637178
## age -0.27353398 0.6023385 -0.3769546 0.61393992
## dis 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad -0.44441282 0.4886763 -0.3816262 0.61978625
## tax -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio -0.17738330 0.3740443 -0.5077867 0.25356836
## black 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat -0.36608690 1.0000000 -0.7376627 0.45326273
## medv 0.33346082 -0.7376627 1.0000000 -0.26301673
## crime01 -0.35121093 0.4532627 -0.2630167 1.00000000
Here, we can see that zn has a moderate(negative) relationship with crime01, and lstat has a moderate positive relationship. Indus, nox, age, rad, and tax all have strong positive relationships with crime01. Dis has a strong negative relationship with crime01.
Logistic Regression
set.seed(1)
Boston.fit <-glm(crime01~ indus+nox+age+dis+rad+tax+lstat+zn, data=Boston.train,family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Boston.probs = predict(Boston.fit, Boston.test, type = "response")
Boston.pred = rep(0, length(Boston.probs))
Boston.pred[Boston.probs > 0.5] = 1
table(Boston.pred, crime01.test)
## crime01.test
## Boston.pred 0 1
## 0 71 22
## 1 19 141
mean(Boston.pred != crime01.test)
## [1] 0.1620553
summary(Boston.fit)
##
## Call:
## glm(formula = crime01 ~ indus + nox + age + dis + rad + tax +
## lstat + zn, family = binomial, data = Boston.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0916 -0.2266 -0.0002 0.2994 3.8441
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.348e+01 8.764e+00 -6.102 1.05e-09 ***
## indus 7.290e-02 1.013e-01 0.719 0.4719
## nox 7.242e+01 1.479e+01 4.897 9.73e-07 ***
## age -1.562e-04 1.508e-02 -0.010 0.9917
## dis 2.460e+00 5.609e-01 4.385 1.16e-05 ***
## rad 1.561e+00 3.380e-01 4.619 3.85e-06 ***
## tax -8.719e-03 5.233e-03 -1.666 0.0957 .
## lstat 9.325e-02 4.884e-02 1.909 0.0562 .
## zn -6.483e-01 1.430e-01 -4.532 5.84e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 329.37 on 252 degrees of freedom
## Residual deviance: 107.62 on 244 degrees of freedom
## AIC: 125.62
##
## Number of Fisher Scoring iterations: 9
The error rate for this model is 16.2%
set.seed(1)
Boston.fit <-glm(crime01~ indus+nox+age+dis+rad+tax, data=Boston.train,family=binomial)
Boston.probs = predict(Boston.fit, Boston.test, type = "response")
Boston.pred = rep(0, length(Boston.probs))
Boston.pred[Boston.probs > 0.5] = 1
table(Boston.pred, crime01.test)
## crime01.test
## Boston.pred 0 1
## 0 75 8
## 1 15 155
mean(Boston.pred != crime01.test)
## [1] 0.09090909
summary(Boston.fit)
##
## Call:
## glm(formula = crime01 ~ indus + nox + age + dis + rad + tax,
## family = binomial, data = Boston.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.97810 -0.21406 -0.03454 0.47107 3.04502
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -42.214032 7.617440 -5.542 2.99e-08 ***
## indus -0.213126 0.073236 -2.910 0.00361 **
## nox 80.868029 16.066473 5.033 4.82e-07 ***
## age 0.003397 0.012032 0.282 0.77772
## dis 0.307145 0.190502 1.612 0.10690
## rad 0.847236 0.183767 4.610 4.02e-06 ***
## tax -0.013760 0.004956 -2.777 0.00549 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 329.37 on 252 degrees of freedom
## Residual deviance: 144.44 on 246 degrees of freedom
## AIC: 158.44
##
## Number of Fisher Scoring iterations: 8
Only using variables with a strong relationship to crime01 reduced the error rate to 9.09%. These are the final variables that will be used in the next classification models.
LDA
Boston.ldafit <-lda(crime01~ indus+nox+age+dis+rad+tax, data=Boston.train,family=binomial)
Bostonlda.pred = predict(Boston.ldafit, Boston.test)
table(Bostonlda.pred$class, crime01.test)
## crime01.test
## 0 1
## 0 81 18
## 1 9 145
mean(Bostonlda.pred$class != crime01.test)
## [1] 0.1067194
KNN
#K=1
train.K=cbind(indus,nox,age,dis,rad,tax)[train,]
test.K=cbind(indus,nox,age,dis,rad,tax)[test,]
Bosknn.pred=knn(train.K, test.K, crime01.test, k=1)
table(Bosknn.pred,crime01.test)
## crime01.test
## Bosknn.pred 0 1
## 0 31 155
## 1 59 8
mean(Bosknn.pred !=crime01.test)
## [1] 0.8458498
#K=100
train.K=cbind(indus,nox,age,dis,rad,tax)[train,]
test.K=cbind(indus,nox,age,dis,rad,tax)[test,]
Bosknn.pred=knn(train.K, test.K, crime01.test, k=100)
table(Bosknn.pred,crime01.test)
## crime01.test
## Bosknn.pred 0 1
## 0 21 6
## 1 69 157
mean(Bosknn.pred !=crime01.test)
## [1] 0.2964427
The results show that the best classification model to use would be the logistic regression since it had the lowest error rate at 9.09%. The LDA model has similar results to our logistic regression model but ended up with a error rate of 10.67%. The KNN model at k =1 has the highest error rate at 84.58%. However, the KNN model with K=100 has an error rate of 29.64%. Showing that an increase in k leads to a reduction in the error rate.