Assignment3

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.5

10. This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

attach(Weekly)
pairs(Weekly[,-9])

cor(Weekly[,-9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

According to the scatterplot matrix, there is a trend between year and volume. Also according to the correlation matrix, the correlation coefficient between Year and Volume is 0.84 which implies that there is a significant correlation between these two variables.There are no significant relationships between the other variables

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm.weekly = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = "binomial")
summary(glm.weekly)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

The p-value of Lag 2 is 0.0296 which is less than the significant value of 0.05. Therefore, Lag 2 is statistically significant.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm.probs=predict(glm.weekly,type="response")
glm.pred=rep("Down",1089)
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction)

##         Direction
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

glm.probs will contain the predicted probabilities from the fitted logistic regression model. glm.pred will contain the classified probabilities as “Up” for those > 0.5 and “Down” for the rest.

According to the above confusion matrix, 430 observations have been classified as “Up” when they are actually “Down”, and 48 observations have been classified as “Down” when they are actually “Up” (mistakes made by the logistic regression model).

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train=(Year>=1990 & Year<=2008) 
Weekly.train = Weekly[train,]
Weekly.test = Weekly[!train,]
glm.fits = glm(Direction ~ Lag2, data = Weekly.train, family = "binomial")
summary(glm.fits)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

dim(Weekly.train)

## [1] 985   9

dim(Weekly.test)

## [1] 104   9

We split the data into train and test. The train data contains 985 observations which have Year from 1990 to 2008. The test data contains 104 observations which have Year 2009 and 2010. We fit the new model glm.fits using the test data.

Direction.test=Direction[!train]
glm.probs=predict(glm.fits,Weekly.test,type="response") #predict probabilities with the 2005 data (that is the test data)
glm.pred=rep("Down",104) #Replicate Down 252 times
glm.pred[glm.probs>.5]="Up" #If probs > 0.5, change it to Up
table(glm.pred,Direction.test)

##         Direction.test
## glm.pred Down Up
##     Down    9  5
##     Up     34 56

56/90

## [1] 0.6222222

The model accuracy is 62.22%

mean(glm.pred==Direction.test)

## [1] 0.625

We fit the logistic regression model glm.fits on the training data and compute the confusion matrix, there are 9 observations which are correctly classified as “Down” and 56 observations correctly classified as “Up”.

(e) Repeat (d) using LDA.

library(MASS)

lda.fits = lda(Direction ~ Lag2, data = Weekly.train)
lda.fits

## Call:
## lda(Direction ~ Lag2, data = Weekly.train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

lda.pred=predict(lda.fits, Weekly.test) 
lda.class=lda.pred$class

table(lda.class,Direction.test)

##          Direction.test
## lda.class Down Up
##      Down    9  5
##      Up     34 56

56/90

## [1] 0.6222222

The model accuracy is 62.22%

mean(lda.class==Direction.test)

## [1] 0.625

The confusion matrix for the model fit using LDA and logistic regression are the same.

(f) Repeat (d) using QDA.

qda.fit=qda(Direction~Lag2,data=Weekly.train)
qda.fit

## Call:
## qda(Direction ~ Lag2, data = Weekly.train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

qda.class=predict(qda.fit,Weekly.test)$class

table(qda.class,Direction.test)

##          Direction.test
## qda.class Down Up
##      Down    0  0
##      Up     43 61

61/104

## [1] 0.5865385

The accuracy score is 58.65% which is less than LDA (62.22%) and logistic regression (62.22%).

(g) Repeat (d) using KNN with K = 1.

library(class)

train.lag2 <-cbind(Weekly.train[,3])
test.lag2 <-cbind(Weekly.test[,3])

train.Direction <- cbind(Weekly.train[,9])
test.Direction <- cbind(Weekly.test[,9])


knn.pred=knn(train.lag2,test.lag2,train.Direction,k=1)
table(knn.pred,test.Direction)

##         test.Direction
## knn.pred  1  2
##        1 21 29
##        2 22 32

mean(knn.pred==test.Direction)

## [1] 0.5096154

(h) Which of these methods appears to provide the best results on this data?

According to the accuracy scores that can be calculated using the cofussion matrix, logistic regression and LDA models have the same accuracy score of 62.5%, QDA has a score of 58.65% and KNN has an accuracy score of 50.96%. Therefore, logistic regression and LDA provide the best result.

(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

Logistic regression with square transformation of predictors

glm.sqfits = glm(Direction ~ Lag2+I(Lag2^2), data = Weekly.train, family = "binomial")
summary(glm.sqfits)

## 
## Call:
## glm(formula = Direction ~ Lag2 + I(Lag2^2), family = "binomial", 
##     data = Weekly.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.791  -1.253   1.005   1.100   1.196  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 0.179006   0.068357   2.619  0.00883 **
## Lag2        0.064920   0.029734   2.183  0.02901 * 
## I(Lag2^2)   0.004713   0.004569   1.031  0.30236   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1349.4  on 982  degrees of freedom
## AIC: 1355.4
## 
## Number of Fisher Scoring iterations: 4

glm.sqprobs = predict(glm.sqfits, Weekly.test, type = "response")
glm.pred = rep("Down", nrow(Weekly.test))
glm.pred[glm.sqprobs > 0.5] = "Up"

table(glm.pred, Direction.test)

##         Direction.test
## glm.pred Down Up
##     Down    8  4
##     Up     35 57

mean(glm.pred==Direction.test)

## [1] 0.625

Logistic regression with interaction between predictors

glm.intfits = glm(Direction ~ Lag1+Lag2+Lag1*Lag2, data = Weekly.train, family = "binomial")
summary(glm.intfits)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag1 * Lag2, family = "binomial", 
##     data = Weekly.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.573  -1.259   1.003   1.086   1.596  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.211419   0.064589   3.273  0.00106 **
## Lag1        -0.051505   0.030727  -1.676  0.09370 . 
## Lag2         0.053471   0.029193   1.832  0.06700 . 
## Lag1:Lag2    0.001921   0.007460   0.257  0.79680   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1346.9  on 981  degrees of freedom
## AIC: 1354.9
## 
## Number of Fisher Scoring iterations: 4

glm.intprobs = predict(glm.intfits, Weekly.test, type = "response")
glm.pred = rep("Down", nrow(Weekly.test))
glm.pred[glm.intprobs > 0.5] = "Up"

table(glm.pred, Direction.test)

##         Direction.test
## glm.pred Down Up
##     Down    7  8
##     Up     36 53

Logistic regression with log transformation of predictors

glm.logfits = glm(Direction ~ log10(abs(Lag2)), data = Weekly.train, family = "binomial")
summary(glm.logfits)

## 
## Call:
## glm(formula = Direction ~ log10(abs(Lag2)), family = "binomial", 
##     data = Weekly.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.305  -1.269   1.074   1.088   1.167  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)       0.20916    0.06410   3.263   0.0011 **
## log10(abs(Lag2))  0.06823    0.13149   0.519   0.6038   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1354.4  on 983  degrees of freedom
## AIC: 1358.4
## 
## Number of Fisher Scoring iterations: 3

glm.logprobs = predict(glm.logfits, Weekly.test, type = "response")
glm.pred = rep("Down", nrow(Weekly.test))
glm.pred[glm.logprobs > 0.5] = "Up"

table(glm.pred, Direction.test)

##         Direction.test
## glm.pred Down Up
##       Up   43 61

mean(glm.pred==Direction.test)

## [1] 0.5865385

Logistic regression with sqrt transformation of predictors

glm.sqrtfits = glm(Direction ~ sqrt(abs(Lag2)), data = Weekly.train, family = "binomial")
summary(glm.sqrtfits)

## 
## Call:
## glm(formula = Direction ~ sqrt(abs(Lag2)), family = "binomial", 
##     data = Weekly.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.405  -1.263   1.058   1.093   1.136  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)      0.09488    0.15028   0.631    0.528
## sqrt(abs(Lag2))  0.09961    0.11788   0.845    0.398
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1354.0  on 983  degrees of freedom
## AIC: 1358
## 
## Number of Fisher Scoring iterations: 3

glm.sqrtprobs = predict(glm.sqrtfits, Weekly.test, type = "response")
glm.pred = rep("Down", nrow(Weekly.test))
glm.pred[glm.sqrtprobs > 0.5] = "Up"

table(glm.pred, Direction.test)

##         Direction.test
## glm.pred Down Up
##       Up   43 61

 mean(glm.pred==Direction.test)

## [1] 0.5865385

LDA with log transformation of predictors

lda.fits = lda(Direction ~ log10(abs(Lag2)), data = Weekly.train)
summary(lda.fits)

##         Length Class  Mode     
## prior   2      -none- numeric  
## counts  2      -none- numeric  
## means   2      -none- numeric  
## scaling 1      -none- numeric  
## lev     2      -none- character
## svd     1      -none- numeric  
## N       1      -none- numeric  
## call    3      -none- call     
## terms   3      terms  call     
## xlevels 0      -none- list

lda.pred=predict(lda.fits, Weekly.test) 
lda.class=lda.pred$class

table(lda.class,Direction.test)

##          Direction.test
## lda.class Down Up
##      Down    0  0
##      Up     43 61

mean(lda.class==Direction.test)

## [1] 0.5865385

QDA with log transformation of predictors

qda.fit=qda(Direction~log10(abs(Lag2)),data=Weekly.train)
qda.fit

## Call:
## qda(Direction ~ log10(abs(Lag2)), data = Weekly.train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##      log10(abs(Lag2))
## Down      0.002797989
## Up        0.018983896

qda.class=predict(qda.fit,Weekly.test)$class

table(qda.class,Direction.test)

##          Direction.test
## qda.class Down Up
##      Down    0  0
##      Up     43 61

mean(qda.class==Direction.test)

## [1] 0.5865385

KNN with k=2

knn.pred=knn(train.lag2,test.lag2,train.Direction,k=2)
table(knn.pred,test.Direction)

##         test.Direction
## knn.pred  1  2
##        1 23 23
##        2 20 38

mean(knn.pred==test.Direction)

## [1] 0.5865385

KNN with k=3

knn.pred=knn(train.lag2,test.lag2,train.Direction,k=3)
table(knn.pred,test.Direction)

##         test.Direction
## knn.pred  1  2
##        1 16 20
##        2 27 41

mean(knn.pred==test.Direction)

## [1] 0.5480769

According to the above different transformations, logistic regression with square transformation of predictors seem to be the best model(Accuracy score 62.5%).KNN classification method with k=2 (with accuracy of 60.57%) is more accurate than K=3(with accuracy score of 54.8%). Logistic regression with interaction between predictors - Accuracy score 57.69%. Logistic regression with log transformation of predictors - Accuracy score 58.65%. Logistic regression with square root transformation of predictors - Accuracy score 58.65%. LDA with log transformation of predictors - Accuracy score 58.65%. QDA with log transformation of predictors - Accuracy score 58.65%

11. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set. (a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

#detach(Weekly)
Auto.data = Auto
attach(Auto.data)
Auto.data$mpg01 <- ifelse(mpg >= median(mpg), 1, 0)
#attach(Auto.data)

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

cor(Auto.data[, -9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

pairs(Auto.data[, -9])

From the above corrolation matrix, scaterplot matrix and boxplots, we can say that the variables cylinders, weight, displacement and horsepower seem to be useful in predicting mpg01.

(c) Split the data into a training set and a test set.

set.seed(123)
index = sample(nrow(Auto.data), 0.8*nrow(Auto.data), replace = F)
train = Auto.data[index,]
test = Auto.data[-index,]

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

lda.fit=lda(mpg01~cylinders+weight+displacement+horsepower,data=train)
lda.fit

## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4920128 0.5079872 
## 
## Group means:
##   cylinders   weight displacement horsepower
## 0  6.798701 3634.422      274.987  131.45455
## 1  4.163522 2323.302      114.283   78.45912
## 
## Coefficients of linear discriminants:
##                       LD1
## cylinders    -0.501316051
## weight       -0.001000777
## displacement -0.001033481
## horsepower    0.003881148

lda.pred=predict(lda.fit, test)
lda.class=lda.pred$class
table(lda.class,test$mpg01)

##          
## lda.class  0  1
##         0 35  4
##         1  7 33

mean(lda.class!=test$mpg01)

## [1] 0.1392405

The test error of the model or the misclassification rate is calculated to be 13.92%

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda.fit=qda(mpg01~cylinders+weight+displacement+horsepower,data=train)
qda.fit

## Call:
## qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4920128 0.5079872 
## 
## Group means:
##   cylinders   weight displacement horsepower
## 0  6.798701 3634.422      274.987  131.45455
## 1  4.163522 2323.302      114.283   78.45912

qda.pred=predict(qda.fit, test)
qda.class=qda.pred$class
table(qda.class,test$mpg01)

##          
## qda.class  0  1
##         0 37  4
##         1  5 33

mean(qda.class!=test$mpg01)

## [1] 0.1139241

The test error or the misclassification rate obtained from the QDA model is 11.39%

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

glm.fit=glm(mpg01~cylinders+weight+displacement+horsepower,data=train, family = "binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.53768  -0.12766   0.07953   0.29411   3.14150  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  13.2775482  2.1223034   6.256 3.94e-10 ***
## cylinders     0.0868500  0.3902663   0.223  0.82389    
## weight       -0.0021961  0.0008331  -2.636  0.00838 ** 
## displacement -0.0151889  0.0093531  -1.624  0.10439    
## horsepower   -0.0514584  0.0168285  -3.058  0.00223 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 433.83  on 312  degrees of freedom
## Residual deviance: 149.26  on 308  degrees of freedom
## AIC: 159.26
## 
## Number of Fisher Scoring iterations: 7

glm.prob=predict(glm.fit, test, type = "response")
glm.pred = rep(0, nrow(test))
glm.pred[glm.prob > 0.5] = 1
table(glm.pred,test$mpg01)

##         
## glm.pred  0  1
##        0 35  5
##        1  7 32

mean(glm.pred!=test$mpg01)

## [1] 0.1518987

The test error or misclassification rate obtained from the logistic regression model is 15.18%

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

train.lag2 = cbind(cylinders,weight,displacement,horsepower)[index,]
test.lag2 <- cbind(cylinders,weight,displacement,horsepower)[-index,]

train.mpg01 <- Auto.data[index,c(10)]
test.mpg01 <- Auto.data[-index,c(10)]

knn.pred=knn(train.lag2,test.lag2,train.mpg01,k=1)
table(knn.pred,test.mpg01)

##         test.mpg01
## knn.pred  0  1
##        0 35  9
##        1  7 28

mean(knn.pred!=test.mpg01)

## [1] 0.2025316

The test error or misclassification rate obtained from the KNN with k=1 is 20.25%.

With k=2

knn.pred=knn(train.lag2,test.lag2,train.mpg01,k=2)
table(knn.pred,test.mpg01)

##         test.mpg01
## knn.pred  0  1
##        0 34  7
##        1  8 30

mean(knn.pred!=test.mpg01)

## [1] 0.1898734

The test error or misclassification rate obtained from the KNN with k=2 is 18.98%.

knn.pred=knn(train.lag2,test.lag2,train.mpg01,k=5)
table(knn.pred,test.mpg01)

##         test.mpg01
## knn.pred  0  1
##        0 35  5
##        1  7 32

mean(knn.pred!=test.mpg01)

## [1] 0.1518987

The test error or misclassification rate obtained from the KNN with k=5 is 15.18%.

K=1 performs a better prediction.

13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

Boston.data = Boston
attach(Boston.data)
Boston.data$crim01 <- ifelse(crim >= median(crim), 1, 0)

set.seed(123)
index = sample(nrow(Boston.data), 0.8*nrow(Boston.data), replace = F)
train = Boston.data[index,]
test = Boston.data[-index,]

Logistic regression model

glm.fit = glm(crim01 ~ . -crim, data = train, family = "binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = crim01 ~ . - crim, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7883  -0.1619  -0.0004   0.0034   3.5146  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.969510   7.953476  -5.151 2.59e-07 ***
## zn           -0.104428   0.043515  -2.400 0.016404 *  
## indus        -0.016195   0.055057  -0.294 0.768642    
## chas          0.293398   0.780450   0.376 0.706965    
## nox          46.349769   8.784897   5.276 1.32e-07 ***
## rm            0.345454   0.815004   0.424 0.671662    
## age           0.021583   0.013700   1.575 0.115165    
## dis           0.637379   0.241450   2.640 0.008296 ** 
## rad           0.619212   0.182004   3.402 0.000668 ***
## tax          -0.008471   0.003154  -2.686 0.007228 ** 
## ptratio       0.417660   0.142776   2.925 0.003441 ** 
## black        -0.008409   0.005533  -1.520 0.128555    
## lstat         0.110754   0.056591   1.957 0.050337 .  
## medv          0.181387   0.079564   2.280 0.022622 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 560.02  on 403  degrees of freedom
## Residual deviance: 166.41  on 390  degrees of freedom
## AIC: 194.41
## 
## Number of Fisher Scoring iterations: 9

glm.prob=predict(glm.fit, test, type = "response")
glm.pred = rep(0, nrow(test))
glm.pred[glm.prob > 0.5] = 1
table(glm.pred,test$crim01)

##         
## glm.pred  0  1
##        0 40  5
##        1  9 48

mean(glm.pred==test$crim01)

## [1] 0.8627451

The logistic regression model accuracy score is 86.27%. The statistically significant predictors are zn, nox, dis, rad, tax, ptratio and medv.

LDA model

lda.fit = lda(crim01 ~ . -crim, data = train)
lda.fit

## Call:
## lda(crim01 ~ . - crim, data = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5049505 0.4950495 
## 
## Group means:
##         zn     indus       chas       nox       rm      age      dis      rad
## 0 22.53431  6.831618 0.06372549 0.4699054 6.387426 50.94363 5.147072  4.22549
## 1  1.01000 15.332850 0.09000000 0.6361000 6.162905 86.59250 2.502883 14.79000
##        tax  ptratio    black     lstat     medv
## 0 309.6765 17.87696 387.8906  9.304951 24.78627
## 1 508.6700 19.02200 332.6636 15.870800 20.19000
## 
## Coefficients of linear discriminants:
##                  LD1
## zn      -0.004875011
## indus    0.043980187
## chas    -0.288736019
## nox      7.826186868
## rm       0.214578770
## age      0.013026384
## dis      0.074899641
## rad      0.085189428
## tax     -0.002392775
## ptratio  0.078848030
## black   -0.001093777
## lstat    0.026660233
## medv     0.044058366

lda.pred=predict(lda.fit, test)
lda.class=lda.pred$class
table(lda.class,test$crim01)

##          
## lda.class  0  1
##         0 41 12
##         1  8 41

mean(lda.class==test$crim01)

## [1] 0.8039216

The LDA model accuracy rate is 80.39%. The logistic regression model is the better one so far.

KNN classification

train.x = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[index,]
test.x <- cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[-index,]

train.crim01 <- Boston.data[index,c(15)]
test.crim01 <- Boston.data[-index,c(15)]

knn.pred=knn(train.x,test.x,train.crim01,k=1)
table(knn.pred,test.crim01)

##         test.crim01
## knn.pred  0  1
##        0 45  3
##        1  4 50

mean(knn.pred==test.crim01)

## [1] 0.9313725

knn.pred=knn(train.x,test.x,train.crim01,k=2)
table(knn.pred,test.crim01)

##         test.crim01
## knn.pred  0  1
##        0 45  5
##        1  4 48

mean(knn.pred==test.crim01)

## [1] 0.9117647

knn.pred=knn(train.x,test.x,train.crim01,k=5)
table(knn.pred,test.crim01)

##         test.crim01
## knn.pred  0  1
##        0 48  4
##        1  1 49

mean(knn.pred==test.crim01)

## [1] 0.9509804

The KNN classification has an accuracy rate of 93.13% when k=1, 90.19% when k=2 and 95.09% when k=5. The KNN classification model with k=1 is the best model with the highest accuracy rate of 93.13% amoung the models are performed.

Assignment3

Sara Shirinkam

6/29/2021

10. This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

The p-value of Lag 2 is 0.0296 which is less than the significant value of 0.05. Therefore, Lag 2 is statistically significant.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm.probs will contain the predicted probabilities from the fitted logistic regression model. glm.pred will contain the classified probabilities as “Up” for those > 0.5 and “Down” for the rest.

According to the above confusion matrix, 430 observations have been classified as “Up” when they are actually “Down”, and 48 observations have been classified as “Down” when they are actually “Up” (mistakes made by the logistic regression model).

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

We split the data into train and test. The train data contains 985 observations which have Year from 1990 to 2008. The test data contains 104 observations which have Year 2009 and 2010. We fit the new model glm.fits using the test data.

The model accuracy is 62.22%

We fit the logistic regression model glm.fits on the training data and compute the confusion matrix, there are 9 observations which are correctly classified as “Down” and 56 observations correctly classified as “Up”.

(e) Repeat (d) using LDA.

The model accuracy is 62.22%

The confusion matrix for the model fit using LDA and logistic regression are the same.

(f) Repeat (d) using QDA.

The accuracy score is 58.65% which is less than LDA (62.22%) and logistic regression (62.22%).

(g) Repeat (d) using KNN with K = 1.

(h) Which of these methods appears to provide the best results on this data?

According to the accuracy scores that can be calculated using the cofussion matrix, logistic regression and LDA models have the same accuracy score of 62.5%, QDA has a score of 58.65% and KNN has an accuracy score of 50.96%. Therefore, logistic regression and LDA provide the best result.

Logistic regression with square transformation of predictors

Logistic regression with interaction between predictors

Logistic regression with log transformation of predictors

Logistic regression with sqrt transformation of predictors

LDA with log transformation of predictors

QDA with log transformation of predictors

KNN with k=2

KNN with k=3

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

From the above corrolation matrix, scaterplot matrix and boxplots, we can say that the variables cylinders, weight, displacement and horsepower seem to be useful in predicting mpg01.

(c) Split the data into a training set and a test set.

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

The test error of the model or the misclassification rate is calculated to be 13.92%

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

The test error or the misclassification rate obtained from the QDA model is 11.39%

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

The test error or misclassification rate obtained from the logistic regression model is 15.18%

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

The test error or misclassification rate obtained from the KNN with k=1 is 20.25%.

With k=2

The test error or misclassification rate obtained from the KNN with k=2 is 18.98%.

The test error or misclassification rate obtained from the KNN with k=5 is 15.18%.

K=1 performs a better prediction.

13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

Logistic regression model

The logistic regression model accuracy score is 86.27%. The statistically significant predictors are zn, nox, dis, rad, tax, ptratio and medv.

LDA model

The LDA model accuracy rate is 80.39%. The logistic regression model is the better one so far.

KNN classification

The KNN classification has an accuracy rate of 93.13% when k=1, 90.19% when k=2 and 95.09% when k=5. The KNN classification model with k=1 is the best model with the highest accuracy rate of 93.13% amoung the models are performed.