This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
attach(Weekly)
pairs(Weekly[,-9])
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
We can see from the scatter plot matrix and the correlation matrix that there are no significant relationships or correlation between the various lags. But there seems to be a significant linear relation between Year and Volume. The correlation matrix also shows that the correlation coefficient is 0.84 between Year and Volume which implies that there is a significant correlation between the two variables.
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
weekly_glm = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = "binomial")
summary(weekly_glm)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 is the only variable that is statistically significant, since it has a p-value of 0.0296 which is less than the significance value of 0.05.
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
probs_glm=predict(weekly_glm,type="response")
pred_glm=rep("Down",length(probs_glm))
pred_glm[probs_glm>0.5]="Up"
table(pred_glm,Direction)
## Direction
## pred_glm Down Up
## Down 54 48
## Up 430 557
To determine the percentage of current predictions, the following formula is used: \(\frac{(54+557)}{(54+48+430+557)}=0.5611\)
This illustrates that the model predicted the weekly market trend correctly 56.11% of the time. Separating in how the model correctly predicts the Up and Down trends. While the model correctly predicted the Up weekly trends \(\frac{557}{48+557}=0.9207\); 92.07% correct. In contrast Down weekly trends were predicted at a lower rate, \(\frac{54}{430+54}=0.1115\); or only 11.15% correctly predicted.
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train=(Year>=1990 & Year<=2008)
Weekly.train = Weekly[train,]
Weekly.test = Weekly[!train,]
fits_glm = glm(Direction ~ Lag2, data = Weekly.train, family = "binomial")
summary(fits_glm)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
We split the data into training and testing splits. The training split contains 985 observations which have Year from 1990 to 2008. The testing split contains 104 observations which have Year 2009 and 2010. We fit the new model fits_glm using the training split.
Direction.test=Direction[!train]
probs_glm=predict(fits_glm,Weekly.test,type="response") #predict probabilities with the 2005 data (that is the test data)
pred_glm=rep("Down",length(probs_glm)) #Replicate Down 252 times
pred_glm[probs_glm>0.5]="Up" #If probs > 0.5, change it to Up
table(pred_glm,Direction.test)
## Direction.test
## pred_glm Down Up
## Down 9 5
## Up 34 56
We use the newly fitted model fits_glm on the testing split and compute the confusion matrix. The table depicts that there are 9 observations which are correctly classified as “Down” and 56 observations correctly classified as “Up”.
mean(pred_glm==Direction.test)
## [1] 0.625
The model correctly predicted weekly trends at rate of 62.5%, which is a moderate improvement from the model that utilized the whole dataset. Also this model did better at predicting upward trends(91.80%) compared to downward trends(20.93%); although this model was able to improve significantly on correctly predicting downward trends.
(e) Repeat (d) using LDA.
fits_lda = lda(Direction ~ Lag2, data = Weekly.train)
fits_lda
## Call:
## lda(Direction ~ Lag2, data = Weekly.train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
pred_lda=predict(fits_lda, Weekly.test)
lda.class=pred_lda$class
table(lda.class,Direction.test)
## Direction.test
## lda.class Down Up
## Down 9 5
## Up 34 56
mean(lda.class==Direction.test)
## [1] 0.625
The confusion matrix for the model fit using LDA produced the same result as that of logistic regression.
(f) Repeat (d) using QDA.
fit_qda=qda(Direction~Lag2,data=Weekly.train)
fit_qda
## Call:
## qda(Direction ~ Lag2, data = Weekly.train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.class=predict(fit_qda,Weekly.test)$class
table(qda.class,Direction.test)
## Direction.test
## qda.class Down Up
## Down 0 0
## Up 43 61
mean(qda.class==Direction.test)
## [1] 0.5865385
The accuracy score of QDA is 58.65% which is lesser than LDA and logistic regression.
(g) Repeat (d) using KNN with K = 1.
train.X <- cbind(Weekly[train,3])
test.X <- cbind(Weekly[!train,3])
train.Direction <- Weekly[train,c(9)]
test.Direction <- Weekly[!train,c(9)]
knn.pred=knn(train.X,test.X,train.Direction,k=1)
table(knn.pred,test.Direction)
## test.Direction
## knn.pred Down Up
## Down 21 29
## Up 22 32
mean(knn.pred==test.Direction)
## [1] 0.5096154
The K-Nearest neighbors resulted in a classifying model with an accuracy rate of 50.96% which is equal to random chance.
(h) Which of these methods appears to provide the best results on this data?
Considering the accuracy score as the metric, we have calculated the scores for each model above. Logistic regression and LDA models both have an accuracy score of 62.5%, QDA has a score of 58.65% and KNN has an accuracy score of 50.96%. We can say that logistic regression and LDA are both best suited for this data.
(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
Logistic regression with square transformation of predictors
glm.sqfits = glm(Direction ~ Lag2+I(Lag2^2), data = Weekly.train, family = "binomial")
summary(glm.sqfits)
##
## Call:
## glm(formula = Direction ~ Lag2 + I(Lag2^2), family = "binomial",
## data = Weekly.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.791 -1.253 1.005 1.100 1.196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.179006 0.068357 2.619 0.00883 **
## Lag2 0.064920 0.029734 2.183 0.02901 *
## I(Lag2^2) 0.004713 0.004569 1.031 0.30236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1349.4 on 982 degrees of freedom
## AIC: 1355.4
##
## Number of Fisher Scoring iterations: 4
glm.sqprobs = predict(glm.sqfits, Weekly.test, type = "response")
glm.pred = rep("Down", nrow(Weekly.test))
glm.pred[glm.sqprobs > 0.5] = "Up"
table(glm.pred, Direction.test)
## Direction.test
## glm.pred Down Up
## Down 8 4
## Up 35 57
mean(glm.pred==Direction.test)
## [1] 0.625
Logistic regression with interaction between predictors
glm.intfits = glm(Direction ~ Lag1+Lag2+Lag1*Lag2, data = Weekly.train, family = "binomial")
summary(glm.intfits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag1 * Lag2, family = "binomial",
## data = Weekly.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.573 -1.259 1.003 1.086 1.596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.211419 0.064589 3.273 0.00106 **
## Lag1 -0.051505 0.030727 -1.676 0.09370 .
## Lag2 0.053471 0.029193 1.832 0.06700 .
## Lag1:Lag2 0.001921 0.007460 0.257 0.79680
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1346.9 on 981 degrees of freedom
## AIC: 1354.9
##
## Number of Fisher Scoring iterations: 4
glm.intprobs = predict(glm.intfits, Weekly.test, type = "response")
glm.pred = rep("Down", nrow(Weekly.test))
glm.pred[glm.intprobs > 0.5] = "Up"
table(glm.pred, Direction.test)
## Direction.test
## glm.pred Down Up
## Down 7 8
## Up 36 53
mean(glm.pred==Direction.test)
## [1] 0.5769231
Logistic regression with log transformation of predictors
glm.logfits = glm(Direction ~ log10(abs(Lag2)), data = Weekly.train, family = "binomial")
summary(glm.logfits)
##
## Call:
## glm(formula = Direction ~ log10(abs(Lag2)), family = "binomial",
## data = Weekly.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.305 -1.269 1.074 1.088 1.167
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20916 0.06410 3.263 0.0011 **
## log10(abs(Lag2)) 0.06823 0.13149 0.519 0.6038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1354.4 on 983 degrees of freedom
## AIC: 1358.4
##
## Number of Fisher Scoring iterations: 3
glm.logprobs = predict(glm.logfits, Weekly.test, type = "response")
glm.pred = rep("Down", nrow(Weekly.test))
glm.pred[glm.logprobs > 0.5] = "Up"
table(glm.pred, Direction.test)
## Direction.test
## glm.pred Down Up
## Up 43 61
mean(glm.pred==Direction.test)
## [1] 0.5865385
Logistic regression with sqrt transformation of predictors
glm.sqrtfits = glm(Direction ~ sqrt(abs(Lag2)), data = Weekly.train, family = "binomial")
summary(glm.sqrtfits)
##
## Call:
## glm(formula = Direction ~ sqrt(abs(Lag2)), family = "binomial",
## data = Weekly.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.405 -1.263 1.058 1.093 1.136
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.09488 0.15028 0.631 0.528
## sqrt(abs(Lag2)) 0.09961 0.11788 0.845 0.398
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1354.0 on 983 degrees of freedom
## AIC: 1358
##
## Number of Fisher Scoring iterations: 3
glm.sqrtprobs = predict(glm.sqrtfits, Weekly.test, type = "response")
glm.pred = rep("Down", nrow(Weekly.test))
glm.pred[glm.sqrtprobs > 0.5] = "Up"
table(glm.pred, Direction.test)
## Direction.test
## glm.pred Down Up
## Up 43 61
mean(glm.pred==Direction.test)
## [1] 0.5865385
LDA with log transformation of predictors
lda.fits = lda(Direction ~ log10(abs(Lag2)), data = Weekly.train)
summary(lda.fits)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 2 -none- numeric
## scaling 1 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 0 -none- list
lda.pred=predict(lda.fits, Weekly.test)
lda.class=lda.pred$class
table(lda.class,Direction.test)
## Direction.test
## lda.class Down Up
## Down 0 0
## Up 43 61
mean(lda.class==Direction.test)
## [1] 0.5865385
QDA with log transformation of predictors
qda.fit=qda(Direction~log10(abs(Lag2)),data=Weekly.train)
qda.fit
## Call:
## qda(Direction ~ log10(abs(Lag2)), data = Weekly.train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## log10(abs(Lag2))
## Down 0.002797989
## Up 0.018983896
qda.class=predict(qda.fit,Weekly.test)$class
table(qda.class,Direction.test)
## Direction.test
## qda.class Down Up
## Down 0 0
## Up 43 61
mean(qda.class==Direction.test)
## [1] 0.5865385
KNN with k=2
knn.pred=knn(train.X,test.X,train.Direction,k=2)
table(knn.pred,test.Direction)
## test.Direction
## knn.pred Down Up
## Down 20 26
## Up 23 35
mean(knn.pred==test.Direction)
## [1] 0.5288462
KNN with k=3
knn.pred=knn(train.X,test.X,train.Direction,k=3)
table(knn.pred,test.Direction)
## test.Direction
## knn.pred Down Up
## Down 16 19
## Up 27 42
mean(knn.pred==test.Direction)
## [1] 0.5576923
In all the above different transformations tried, the following are the results -
Overall, logistic regression with square transformation of predictors seem to be the best model. Out of the different KNN classification methods tried, k=2 performed better.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
detach(Weekly)
attach(Auto)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
Auto$mpg01 <- ifelse(mpg >= median(mpg), 1, 0)
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
pairs(Auto[,-9])
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
From the scatter plot and correlation matrix above, we can say that the variables cylinders, weight, displacement and horsepower seem to be useful in predicting mpg01.
(c) Split the data into a training set and a test set.
set.seed(123)
index = sample(nrow(Auto), 0.8*nrow(Auto), replace = F)
train = Auto[index,]
test = Auto[-index,]
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
fit_lda=lda(mpg01~cylinders+weight+displacement+horsepower,data=train)
fit_lda
## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4920128 0.5079872
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.798701 3634.422 274.987 131.45455
## 1 4.163522 2323.302 114.283 78.45912
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.501316051
## weight -0.001000777
## displacement -0.001033481
## horsepower 0.003881148
lda.pred=predict(fit_lda, test)
lda.class=lda.pred$class
table(lda.class,test$mpg01)
##
## lda.class 0 1
## 0 35 4
## 1 7 33
mean(lda.class!=test$mpg01)
## [1] 0.1392405
The test error of the model or the misclassification rate is calculated to be 13.92%
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda.fit=qda(mpg01~cylinders+weight+displacement+horsepower,data=train)
qda.fit
## Call:
## qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4920128 0.5079872
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.798701 3634.422 274.987 131.45455
## 1 4.163522 2323.302 114.283 78.45912
qda.pred=predict(qda.fit, test)
qda.class=qda.pred$class
table(qda.class,test$mpg01)
##
## qda.class 0 1
## 0 37 4
## 1 5 33
mean(qda.class!=test$mpg01)
## [1] 0.1139241
The test error or the misclassification rate obtained from the QDA model is 11.39%
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
glm.fit=glm(mpg01~cylinders+weight+displacement+horsepower,data=train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.53768 -0.12766 0.07953 0.29411 3.14150
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.2775482 2.1223034 6.256 3.94e-10 ***
## cylinders 0.0868500 0.3902663 0.223 0.82389
## weight -0.0021961 0.0008331 -2.636 0.00838 **
## displacement -0.0151889 0.0093531 -1.624 0.10439
## horsepower -0.0514584 0.0168285 -3.058 0.00223 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 433.83 on 312 degrees of freedom
## Residual deviance: 149.26 on 308 degrees of freedom
## AIC: 159.26
##
## Number of Fisher Scoring iterations: 7
glm.prob=predict(glm.fit, test, type = "response")
glm.pred = rep(0, nrow(test))
glm.pred[glm.prob > 0.5] = 1
table(glm.pred,test$mpg01)
##
## glm.pred 0 1
## 0 35 5
## 1 7 32
mean(glm.pred!=test$mpg01)
## [1] 0.1518987
The test error or misclassification rate obtained from the logistic regression model is 15.18%
(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
With k = 1
train.X = cbind(cylinders,weight,displacement,horsepower)[index,]
test.X <- cbind(cylinders,weight,displacement,horsepower)[-index,]
train.mpg01 <- Auto[index,c(10)]
test.mpg01 <- Auto[-index,c(10)]
knn.pred=knn(train.X,test.X,train.mpg01,k=1)
table(knn.pred,test.mpg01)
## test.mpg01
## knn.pred 0 1
## 0 35 9
## 1 7 28
mean(knn.pred!=test.mpg01)
## [1] 0.2025316
With k = 2
knn.pred=knn(train.X,test.X,train.mpg01,k=2)
table(knn.pred,test.mpg01)
## test.mpg01
## knn.pred 0 1
## 0 34 7
## 1 8 30
mean(knn.pred!=test.mpg01)
## [1] 0.1898734
With k=10
knn.pred=knn(train.X,test.X,train.mpg01,k=10)
table(knn.pred,test.mpg01)
## test.mpg01
## knn.pred 0 1
## 0 35 5
## 1 7 32
mean(knn.pred!=test.mpg01)
## [1] 0.1518987
With k = 20
knn.pred=knn(train.X,test.X,train.mpg01,k=20)
table(knn.pred,test.mpg01)
## test.mpg01
## knn.pred 0 1
## 0 35 3
## 1 7 34
mean(knn.pred!=test.mpg01)
## [1] 0.1265823
When k=1, the test error is 20.25%. When k=2, the test error is 18.98%. When k=10, the test error is 15.18%. When k=20, the test error is 12.65%. Out of these, k=20 gives us the best model with least error rate.
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
detach(Auto)
attach(Boston)
Boston$crim01 = ifelse(crim >= median(crim), 1, 0)
set.seed(123)
index = sample(nrow(Boston), 0.8*nrow(Boston), replace = F)
train = Boston[index,]
test = Boston[-index,]
Logistic regression model
glm.fit = glm(crim01 ~ . -crim, data = train, family = "binomial")
summary(glm.fit)
##
## Call:
## glm(formula = crim01 ~ . - crim, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7883 -0.1619 -0.0004 0.0034 3.5146
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.969510 7.953476 -5.151 2.59e-07 ***
## zn -0.104428 0.043515 -2.400 0.016404 *
## indus -0.016195 0.055057 -0.294 0.768642
## chas 0.293398 0.780450 0.376 0.706965
## nox 46.349769 8.784897 5.276 1.32e-07 ***
## rm 0.345454 0.815004 0.424 0.671662
## age 0.021583 0.013700 1.575 0.115165
## dis 0.637379 0.241450 2.640 0.008296 **
## rad 0.619212 0.182004 3.402 0.000668 ***
## tax -0.008471 0.003154 -2.686 0.007228 **
## ptratio 0.417660 0.142776 2.925 0.003441 **
## black -0.008409 0.005533 -1.520 0.128555
## lstat 0.110754 0.056591 1.957 0.050337 .
## medv 0.181387 0.079564 2.280 0.022622 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.02 on 403 degrees of freedom
## Residual deviance: 166.41 on 390 degrees of freedom
## AIC: 194.41
##
## Number of Fisher Scoring iterations: 9
glm.prob=predict(glm.fit, test, type = "response")
glm.pred = rep(0, nrow(test))
glm.pred[glm.prob > 0.5] = 1
table(glm.pred,test$crim01)
##
## glm.pred 0 1
## 0 40 5
## 1 9 48
mean(glm.pred==test$crim01)
## [1] 0.8627451
The logistic regression model gives a pretty good accuracy score of 86.27%. The statistically significant predictors are zn, nox, dis, rad, tax, ptratio and medv.
LDA Model
lda.fit = lda(crim01 ~ . -crim, data = train)
lda.fit
## Call:
## lda(crim01 ~ . - crim, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.5049505 0.4950495
##
## Group means:
## zn indus chas nox rm age dis rad
## 0 22.53431 6.831618 0.06372549 0.4699054 6.387426 50.94363 5.147072 4.22549
## 1 1.01000 15.332850 0.09000000 0.6361000 6.162905 86.59250 2.502883 14.79000
## tax ptratio black lstat medv
## 0 309.6765 17.87696 387.8906 9.304951 24.78627
## 1 508.6700 19.02200 332.6636 15.870800 20.19000
##
## Coefficients of linear discriminants:
## LD1
## zn -0.004875011
## indus 0.043980187
## chas -0.288736019
## nox 7.826186868
## rm 0.214578770
## age 0.013026384
## dis 0.074899641
## rad 0.085189428
## tax -0.002392775
## ptratio 0.078848030
## black -0.001093777
## lstat 0.026660233
## medv 0.044058366
lda.pred=predict(lda.fit, test)
lda.class=lda.pred$class
table(lda.class,test$crim01)
##
## lda.class 0 1
## 0 41 12
## 1 8 41
mean(lda.class==test$crim01)
## [1] 0.8039216
The LDA model has an accuracy rate of 80.39% which does not match up to the logistic regression model. The latter is still the better one.
KNN Classification - With k = 1
train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[index,]
test.X <- cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[-index,]
train.crim01 <- Boston[index,c(15)]
test.crim01 <- Boston[-index,c(15)]
knn.pred=knn(train.X,test.X,train.crim01,k=1)
table(knn.pred,test.crim01)
## test.crim01
## knn.pred 0 1
## 0 45 3
## 1 4 50
mean(knn.pred==test.crim01)
## [1] 0.9313725
KNN Classification - With k = 2
knn.pred=knn(train.X,test.X,train.crim01,k=2)
table(knn.pred,test.crim01)
## test.crim01
## knn.pred 0 1
## 0 45 5
## 1 4 48
mean(knn.pred==test.crim01)
## [1] 0.9117647
KNN Classification - With k = 20
knn.pred=knn(train.X,test.X,train.crim01,k=20)
table(knn.pred,test.crim01)
## test.crim01
## knn.pred 0 1
## 0 45 8
## 1 4 45
mean(knn.pred==test.crim01)
## [1] 0.8823529
The KNN classification has an accuracy rate of 93.13% when k=1, 91.17% when k=2 and 88.23% when k=20. Out of all the models tried so far, the KNN classification model with k=1 is the best model with the highest accuracy rate of 93.13%