Question 10

This question should be answered using the weekly data set which is a part of ISLR package and it contains 1089 weekly returns from the beginning of 1990 to the end of 2010

Question 10(a)

Produce some numerical and graphical summaries of the weekly data. Do there appear to be any patterns

names(Weekly)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
attach(Weekly)
head(Weekly)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly)

Based on the Correlation plot we cant find much information about the variables that are correlated with each other. So we will find out the correlation coefficient

cor(Weekly[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

As per the Correlation coefficient,all the coefficients between current weeks return and previous week return is very less and it is close to zero.It implies that there is very little correlation between the current week return and previous weeks return. The only noticeable correlation is between volume and year

plot(Volume)

As per the above the Volume of the traded share on each week from 1990 to 2010 is on the upward direction only.

Question 10(b)

Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus volume as predictors. Use the summary function to predict the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm.weekly.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,family = binomial,data = Weekly)
summary(glm.weekly.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

As per the logistic regression model, the p- value for the Lag1 predictor is very small compared to the other predictors which implies that the lag1 predictor is significant predictor in this model. Also it has the positive coefficient which implies that if there is a positive return in the previous week, then there will be a upward movement in the current week stock market

Question 10(C)

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of markets made by logistic regression

glm.weekly.probs= predict(glm.weekly.fit,type = "response")
glm.weekly.pred = rep("Down", 1089)
glm.weekly.pred[glm.weekly.probs>0.5]="Up"
table(glm.weekly.pred,Direction)
##                Direction
## glm.weekly.pred Down  Up
##            Down   54  48
##            Up    430 557
mean(glm.weekly.pred==Direction) 
## [1] 0.5610652

As per the error rate , only 56.1% predictions are correct. The performance rate is very low . it is little bit higher than the guessing probability. So we can perform the other fitted models to find out the best fit model

Question 10(d)

Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor.Compute the confusion matrix and overall fraction of correct predictions for the held out data

train = (Weekly$Year>=1990&Weekly$Year<=2008)
Weekly.2009.greater = Weekly[!train,]
dim(Weekly.2009.greater)
## [1] 104   9
Direction.2009.greater = Direction[!train]
length(Direction.2009.greater)
## [1] 104
glm.weekly.fit1= glm(Direction~Lag2,family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit1)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
glm.weekly.probs1= predict(glm.weekly.fit1, Weekly.2009.greater, type="response")
glm.weekly.pred1=rep("Down",104)
glm.weekly.pred1[glm.weekly.probs1>0.5]="Up"
table(glm.weekly.pred1,Direction.2009.greater)
##                 Direction.2009.greater
## glm.weekly.pred1 Down Up
##             Down    9  5
##             Up     34 56
mean(glm.weekly.pred1 == Direction.2009.greater)
## [1] 0.625

When we fit the model only with the previous 2 weeks return , the performance is increased from 52.5% to 62.5% Compared to the full model , this model shows the better performance

Question 10 (e)

Repeat (d) using LDA

lda.fit = lda(Direction~Lag2,data = Weekly,subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162
lda.pred= predict(lda.fit,Weekly.2009.greater)
table(lda.pred$class,Direction.2009.greater)
##       Direction.2009.greater
##        Down Up
##   Down    9  5
##   Up     34 56
mean(lda.pred$class==Direction.2009.greater)
## [1] 0.625

LDA performance rate is same as logistic regression performance rate

Question 10(f)

Repeat (d) using QDA

qda.fit=qda(Direction~Lag2,data = Weekly,subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
qda.pred = predict(qda.fit,Weekly.2009.greater)
qda.pred$class
##   [1] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
##  [26] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
##  [51] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
##  [76] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [101] Up Up Up Up
## Levels: Down Up
table(qda.pred$class,Direction.2009.greater)
##       Direction.2009.greater
##        Down Up
##   Down    0  0
##   Up     43 61
mean(qda.pred$class==Direction.2009.greater)
## [1] 0.5865385

Compared to LDA performance rate and logistic regression performance rate, QDA performance rate is little bit less, which makes the performance rate is more or less in close with the guess estimate

Question 10(g)

Repeat d using KNN with K=1

train.weekly.x = Weekly[train,3]
dim(train.weekly.x)=c(985,1)
test.weekly.x = Weekly[!train,3]
dim(test.weekly.x)=c(104,1)
train.weekly.direction = Direction[train]
set.seed(1)
knn.weekly.pred1=knn(train.weekly.x,test.weekly.x,train.weekly.direction,k=1)
table(knn.weekly.pred1,Direction.2009.greater)
##                 Direction.2009.greater
## knn.weekly.pred1 Down Up
##             Down   21 30
##             Up     22 31
mean(knn.weekly.pred1==Direction.2009.greater)
## [1] 0.5

Compared to LDA performance rate,logistic regression performance rate and QDA performance rate, KNN performance rate is less which makes the performance rate in close with the guess estimate so this model is not a best fit model.

Question 10(h)

Which of these methods appear to be provide the best results on this data

By comparing all the performance rate for logistic regression, LDA, QDA and KNN with K-1 for the model which contains the response variable as Direction and predictor as Lag2, the Logistic and LDA provides the best performance rate of 62.5% .So I am concluding here that logistic regression for this model will yield the best results

Question 10(I)

Experiment with different combinations of predictors, including possible transformations and interactions for each of the methods. Report the variables and associated confusion matrix that appears to provide best results on the held out data

Try with Higher K values for the model with Lag2

set.seed(1)
knn.weekly.pred2=knn(train.weekly.x,test.weekly.x,train.weekly.direction,k=5)
table(knn.weekly.pred2,Direction.2009.greater)
##                 Direction.2009.greater
## knn.weekly.pred2 Down Up
##             Down   16 21
##             Up     27 40
mean(knn.weekly.pred2==Direction.2009.greater)
## [1] 0.5384615
set.seed(1)
knn.weekly.pred3=knn(train.weekly.x,test.weekly.x,train.weekly.direction,k=9)
table(knn.weekly.pred3,Direction.2009.greater)
##                 Direction.2009.greater
## knn.weekly.pred3 Down Up
##             Down   17 20
##             Up     26 41
mean(knn.weekly.pred3==Direction.2009.greater)
## [1] 0.5576923

Increasing the K values doesn’t so much improvement in the performance rate, so I am trying with the Different models

Model with the Lag1 and Lag2

glm.weekly.fit2= glm(Direction~Lag1+Lag2,family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit2)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6149  -1.2565   0.9989   1.0875   1.5330  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.21109    0.06456   3.269  0.00108 **
## Lag1        -0.05421    0.02886  -1.878  0.06034 . 
## Lag2         0.05384    0.02905   1.854  0.06379 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1347.0  on 982  degrees of freedom
## AIC: 1353
## 
## Number of Fisher Scoring iterations: 4
glm.weekly.probs2= predict(glm.weekly.fit2, Weekly.2009.greater, type="response")
glm.weekly.pred2=rep("Down",104)
glm.weekly.pred2[glm.weekly.probs2>0.5]="Up"
table(glm.weekly.pred2,Direction.2009.greater)
##                 Direction.2009.greater
## glm.weekly.pred2 Down Up
##             Down    7  8
##             Up     36 53
mean(glm.weekly.pred2==Direction.2009.greater)
## [1] 0.5769231

Model with the all the Lag parameters

glm.weekly.fit3= glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5,family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit3)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, family = binomial, 
##     data = Weekly, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8534  -1.2491   0.9941   1.0886   1.5126  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.220141   0.065069   3.383 0.000716 ***
## Lag1        -0.055438   0.029051  -1.908 0.056357 .  
## Lag2         0.053290   0.029348   1.816 0.069401 .  
## Lag3        -0.009292   0.029265  -0.318 0.750859    
## Lag4        -0.024484   0.028970  -0.845 0.398036    
## Lag5        -0.031544   0.029005  -1.088 0.276796    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1345.1  on 979  degrees of freedom
## AIC: 1357.1
## 
## Number of Fisher Scoring iterations: 4
glm.weekly.probs3= predict(glm.weekly.fit3, Weekly.2009.greater, type="response")
glm.weekly.pred3=rep("Down",104)
glm.weekly.pred3[glm.weekly.probs3>0.5]="Up"
table(glm.weekly.pred3,Direction.2009.greater)
##                 Direction.2009.greater
## glm.weekly.pred3 Down Up
##             Down   10 14
##             Up     33 47
mean(glm.weekly.pred3==Direction.2009.greater)
## [1] 0.5480769

Model with the Lag1

glm.weekly.fit4= glm(Direction~Lag1,family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit4)
## 
## Call:
## glm(formula = Direction ~ Lag1, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.519  -1.253   1.028   1.094   1.281  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.21829    0.06438   3.391 0.000697 ***
## Lag1        -0.05908    0.02892  -2.043 0.041059 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.4  on 983  degrees of freedom
## AIC: 1354.4
## 
## Number of Fisher Scoring iterations: 4
glm.weekly.probs4= predict(glm.weekly.fit4, Weekly.2009.greater, type="response")
glm.weekly.pred4=rep("Down",104)
glm.weekly.pred4[glm.weekly.probs4>0.5]="Up"
table(glm.weekly.pred4,Direction.2009.greater)
##                 Direction.2009.greater
## glm.weekly.pred4 Down Up
##             Down    4  6
##             Up     39 55
mean(glm.weekly.pred4==Direction.2009.greater)
## [1] 0.5673077

Model with Lag1 and Lag1^2

glm.weekly.fit5= glm(Direction~Lag1+I(Lag1^2),family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit5)
## 
## Call:
## glm(formula = Direction ~ Lag1 + I(Lag1^2), family = binomial, 
##     data = Weekly, subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.529  -1.253   1.029   1.094   1.273  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.2166882  0.0687066   3.154  0.00161 **
## Lag1        -0.0588584  0.0291346  -2.020  0.04336 * 
## I(Lag1^2)    0.0003167  0.0047503   0.067  0.94684   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.4  on 982  degrees of freedom
## AIC: 1356.4
## 
## Number of Fisher Scoring iterations: 4
glm.weekly.probs5= predict(glm.weekly.fit5, Weekly.2009.greater, type="response")
glm.weekly.pred5=rep("Down",104)
glm.weekly.pred5[glm.weekly.probs5>0.5]="Up"
table(glm.weekly.pred5,Direction.2009.greater)
##                 Direction.2009.greater
## glm.weekly.pred5 Down Up
##             Down    4  5
##             Up     39 56
mean(glm.weekly.pred5==Direction.2009.greater)
## [1] 0.5769231

Model with Lag2 and Lag2^2

glm.weekly.fit6= glm(Direction~Lag2+I(Lag2^2),family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit6)
## 
## Call:
## glm(formula = Direction ~ Lag2 + I(Lag2^2), family = binomial, 
##     data = Weekly, subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.791  -1.253   1.005   1.100   1.196  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 0.179006   0.068357   2.619  0.00883 **
## Lag2        0.064920   0.029734   2.183  0.02901 * 
## I(Lag2^2)   0.004713   0.004569   1.031  0.30236   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1349.4  on 982  degrees of freedom
## AIC: 1355.4
## 
## Number of Fisher Scoring iterations: 4
glm.weekly.probs6= predict(glm.weekly.fit6, Weekly.2009.greater, type="response")
glm.weekly.pred6=rep("Down",104)
glm.weekly.pred6[glm.weekly.probs6>0.5]="Up"
table(glm.weekly.pred6,Direction.2009.greater)
##                 Direction.2009.greater
## glm.weekly.pred6 Down Up
##             Down    8  4
##             Up     35 57
mean(glm.weekly.pred6==Direction.2009.greater)
## [1] 0.625

Model with Lag1 and Lag2^2

glm.weekly.fit7= glm(Direction~Lag1+I(Lag2^2),family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit7)
## 
## Call:
## glm(formula = Direction ~ Lag1 + I(Lag2^2), family = binomial, 
##     data = Weekly, subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.602  -1.251   1.025   1.094   1.275  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.202658   0.068161   2.973  0.00295 **
## Lag1        -0.060183   0.029104  -2.068  0.03865 * 
## I(Lag2^2)    0.003120   0.004518   0.691  0.48982   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1349.9  on 982  degrees of freedom
## AIC: 1355.9
## 
## Number of Fisher Scoring iterations: 4
glm.weekly.probs7= predict(glm.weekly.fit7, Weekly.2009.greater, type="response")
glm.weekly.pred7=rep("Down",104)
glm.weekly.pred7[glm.weekly.probs7>0.5]="Up"
table(glm.weekly.pred7,Direction.2009.greater)
##                 Direction.2009.greater
## glm.weekly.pred7 Down Up
##             Down    6  6
##             Up     37 55
mean(glm.weekly.pred7==Direction.2009.greater)
## [1] 0.5865385

Model with Lag2 and Lag1^2

glm.weekly.fit8= glm(Direction~Lag2+I(Lag1^2),family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit8)
## 
## Call:
## glm(formula = Direction ~ Lag2 + I(Lag1^2), family = binomial, 
##     data = Weekly, subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.587  -1.259   1.015   1.093   1.338  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 0.181618   0.068622   2.647  0.00813 **
## Lag2        0.065085   0.029785   2.185  0.02888 * 
## I(Lag1^2)   0.004079   0.004583   0.890  0.37339   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1349.7  on 982  degrees of freedom
## AIC: 1355.7
## 
## Number of Fisher Scoring iterations: 4
glm.weekly.probs8= predict(glm.weekly.fit8, Weekly.2009.greater, type="response")
glm.weekly.pred8=rep("Down",104)
glm.weekly.pred8[glm.weekly.probs8>0.5]="Up"
table(glm.weekly.pred8,Direction.2009.greater)
##                 Direction.2009.greater
## glm.weekly.pred8 Down Up
##             Down    8  2
##             Up     35 59
mean(glm.weekly.pred8==Direction.2009.greater)
## [1] 0.6442308

Out of the above all models, Model with Lag2 and Lag1^2 shows the better performance rate of predicting the correct stock market direction.Out of 104 day 64 days are predicted correctly by this model

detach(Weekly)

Question 11

In this problem you will develop a model to predict within whether a given car gets high or low gas mileage based on the Auto Data set

names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"
attach(Auto)
## The following object is masked from package:ggplot2:
## 
##     mpg
dim(Auto)
## [1] 392   9

Question 11(a)

Create a binary variable mpg01 that contains a 1 if mpg contains a value above its median and a 0 if mpg contains a value below its median.You can compute the median using median() function.

mpg01 = rep(0, 392)
mpg01[mpg>median(mpg)]=1
df.auto = data.frame(Auto,mpg01)
View(df.auto)

Question 11(b)

Explore the data graphically in order to investigate the association between mpg01 and other features. which of the other features are most likely to be useful in predicting mpg01? scatter plots and box plots may be useful tools to answer this question.Describe your findings

pairs(df.auto)

names(df.auto)
##  [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
##  [6] "acceleration" "year"         "origin"       "name"         "mpg01"
cor(df.auto[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

As per the scatter plots and correlation coefficient mpg01 is highly correlated in positive direction with mpg and negative correlated with weight, displacement horsepower and cylinders.Also Mpg is used to derive the value of mpg01 . so I am using the other variables such as weight, displacement, cylinders,and horsepower to predict the best fit model

Question 11(c)

Split the data into training set and test set

set.seed(123)
trainIndex = createDataPartition(df.auto$mpg01,p=0.8,list=FALSE)
df.auto.train = df.auto[trainIndex,]
df.auto.test = df.auto[-trainIndex,]
mpg01.train=df.auto[trainIndex,10]
mpg01.test= factor(df.auto[-trainIndex,10])

Question 11(d)

Perform LDA on the training data in order to predict mpg01 using the variables that seemed to be more associated with mpg01 in b. What is the test error of the model obtained

detach(Auto)
attach(df.auto)
## The following object is masked _by_ .GlobalEnv:
## 
##     mpg01
## The following object is masked from package:ggplot2:
## 
##     mpg
lda.auto.fit = lda(mpg01~weight+displacement+cylinders+horsepower,data = df.auto.train)
lda.auto.fit
## Call:
## lda(mpg01 ~ weight + displacement + cylinders + horsepower, data = df.auto.train)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##     weight displacement cylinders horsepower
## 0 3606.701     275.4968  6.808917  130.59873
## 1 2345.070     116.4108  4.197452   79.09554
## 
## Coefficients of linear discriminants:
##                        LD1
## weight       -0.0008475776
## displacement -0.0025319858
## cylinders    -0.5010010916
## horsepower    0.0049331691
lda.auto.pred= predict(lda.auto.fit,df.auto.test)
names(lda.auto.pred)
## [1] "class"     "posterior" "x"
table(lda.auto.pred$class,mpg01.test)
##    mpg01.test
##      0  1
##   0 31  1
##   1  8 38
mean(lda.auto.pred$class==mpg01.test)
## [1] 0.8846154
mean(lda.auto.pred$class!=mpg01.test)
## [1] 0.1153846

Test Error Rate for Linear Discriminant Analysis method is 11.5%

Question 11(e)

Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in b.what is the test error of the model obtained

qda.auto.fit = qda(mpg01~weight+displacement+cylinders+horsepower,data = df.auto.train)
qda.auto.fit
## Call:
## qda(mpg01 ~ weight + displacement + cylinders + horsepower, data = df.auto.train)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##     weight displacement cylinders horsepower
## 0 3606.701     275.4968  6.808917  130.59873
## 1 2345.070     116.4108  4.197452   79.09554
qda.auto.pred= predict(qda.auto.fit,df.auto.test)
table(qda.auto.pred$class,mpg01.test)
##    mpg01.test
##      0  1
##   0 32  3
##   1  7 36
mean(qda.auto.pred$class==mpg01.test)
## [1] 0.8717949
mean(qda.auto.pred$class!=mpg01.test)
## [1] 0.1282051

Test Error Rate for Quadratic Discriminant Analysis method id 12.8%

Question 11(f)

Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in b.what is the test error of the model obtained

glm.auto.fit = glm(mpg01~weight+displacement+cylinders+horsepower,family = binomial,data = df.auto.train)
summary(glm.auto.fit)
## 
## Call:
## glm(formula = mpg01 ~ weight + displacement + cylinders + horsepower, 
##     family = binomial, data = df.auto.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2318  -0.1773   0.0487   0.3398   3.4044  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  11.4010959  1.8995270   6.002 1.95e-09 ***
## weight       -0.0016414  0.0008099  -2.027  0.04271 *  
## displacement -0.0158997  0.0090122  -1.764  0.07769 .  
## cylinders     0.0011353  0.3834886   0.003  0.99764    
## horsepower   -0.0418884  0.0150907  -2.776  0.00551 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 435.30  on 313  degrees of freedom
## Residual deviance: 163.19  on 309  degrees of freedom
## AIC: 173.19
## 
## Number of Fisher Scoring iterations: 7
glm.auto.probs = predict(glm.auto.fit,df.auto.test,type = "response")
glm.auto.pred= rep(0,78)
glm.auto.pred[glm.auto.probs>0.5]=1
table(glm.auto.pred,mpg01.test)
##              mpg01.test
## glm.auto.pred  0  1
##             0 31  2
##             1  8 37
mean(glm.auto.pred==mpg01.test)
## [1] 0.8717949
mean(glm.auto.pred!=mpg01.test)
## [1] 0.1282051

Test Error Rate is 12.8% for logistic regression method

Question 11(g)

perform KNN on the training data, with several values of K in order to predict mpg01.use only the variables that seemed most associated with mpg01 in (b). what test errors do you obtain?which value of K seems to perform the best fit on this data

train2_matrix = data.matrix(df.auto.train[,c("cylinders","displacement","weight","horsepower")])
test2_matrix = data.matrix(df.auto.test[,c("cylinders","displacement","weight","horsepower")])
train2_y = data.matrix(df.auto.train$mpg01)
set.seed(1)
knn.auto.pred=knn(train2_matrix,test2_matrix,train2_y,k=1)
table(knn.auto.pred,mpg01.test)
##              mpg01.test
## knn.auto.pred  0  1
##             0 30  2
##             1  9 37
mean(knn.auto.pred==mpg01.test)
## [1] 0.8589744
mean(knn.auto.pred!=mpg01.test)
## [1] 0.1410256

Test error When K=1 in KNN method is 14.1%

When K=3

set.seed(1)
knn.auto.pred1=knn(train2_matrix,test2_matrix,train2_y,k=3)
table(knn.auto.pred1,mpg01.test)
##               mpg01.test
## knn.auto.pred1  0  1
##              0 30  0
##              1  9 39
mean(knn.auto.pred1==mpg01.test)
## [1] 0.8846154
mean(knn.auto.pred1!=mpg01.test)
## [1] 0.1153846

When K=5

set.seed(1)
knn.auto.pred2=knn(train2_matrix,test2_matrix,train2_y,k=5)
table(knn.auto.pred2,mpg01.test)
##               mpg01.test
## knn.auto.pred2  0  1
##              0 31  3
##              1  8 36
mean(knn.auto.pred2==mpg01.test)
## [1] 0.8589744
mean(knn.auto.pred2!=mpg01.test)
## [1] 0.1410256

When K=7

set.seed(1)
knn.auto.pred3=knn(train2_matrix,test2_matrix,train2_y,k=7)
table(knn.auto.pred3,mpg01.test)
##               mpg01.test
## knn.auto.pred3  0  1
##              0 31  1
##              1  8 38
mean(knn.auto.pred3==mpg01.test)
## [1] 0.8846154
mean(knn.auto.pred3!=mpg01.test)
## [1] 0.1153846

As per the above K values,K=3 and K-7 provides the small test error rate and high performance rate for the KNN method for the model with the variables weight, cylinders, displacement and horsepower in order to predict mpg01

Question 13

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression , LDA, KNN models using various subset of parameters. Describe your findings

names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
dim(Boston)
## [1] 506  14
attach(Boston)
crime01 = rep(0, 506)
crime01[crim>median(crim)]=1
df.boston = data.frame(Boston,crime01)
str(df.boston)
## 'data.frame':    506 obs. of  15 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##  $ crime01: num  0 0 0 0 0 0 0 0 0 0 ...
pairs(df.boston)

cor(df.boston$crime01,df.boston)
##           crim        zn     indus       chas       nox         rm       age
## [1,] 0.4093955 -0.436151 0.6032602 0.07009677 0.7232348 -0.1563718 0.6139399
##             dis       rad       tax   ptratio      black     lstat       medv
## [1,] -0.6163416 0.6197862 0.6087413 0.2535684 -0.3512109 0.4532627 -0.2630167
##      crime01
## [1,]       1

Split the train and test Dataset

set.seed(123)
trainIndex = createDataPartition(df.boston$crime01,p=0.8,list=FALSE)
df.boston.train = df.boston[trainIndex,]
df.boston.test = df.boston[-trainIndex,]
crime01.test= factor(df.boston[-trainIndex,15])
detach(Boston)
attach(df.boston)
## The following object is masked _by_ .GlobalEnv:
## 
##     crime01

Logistic Regression

glm.boston.fit = glm(crime01~zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,family = binomial,data = df.boston.train)
summary(glm.boston.fit)
## 
## Call:
## glm(formula = crime01 ~ zn + indus + chas + nox + rm + age + 
##     dis + rad + tax + ptratio + black + lstat + medv, family = binomial, 
##     data = df.boston.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4149  -0.2022  -0.0004   0.0030   3.5210  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -35.258213   7.435012  -4.742 2.11e-06 ***
## zn           -0.079313   0.038839  -2.042 0.041141 *  
## indus        -0.040156   0.049127  -0.817 0.413700    
## chas          0.841962   0.845047   0.996 0.319080    
## nox          48.999078   8.538917   5.738 9.56e-09 ***
## rm           -0.290772   0.780705  -0.372 0.709559    
## age           0.032608   0.013783   2.366 0.017990 *  
## dis           0.796546   0.246093   3.237 0.001209 ** 
## rad           0.632139   0.171349   3.689 0.000225 ***
## tax          -0.007036   0.002989  -2.354 0.018568 *  
## ptratio       0.371711   0.136979   2.714 0.006655 ** 
## black        -0.013372   0.006539  -2.045 0.040858 *  
## lstat        -0.020232   0.055991  -0.361 0.717845    
## medv          0.152260   0.075747   2.010 0.044420 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 562.84  on 405  degrees of freedom
## Residual deviance: 173.89  on 392  degrees of freedom
## AIC: 201.89
## 
## Number of Fisher Scoring iterations: 9

As per the logistic regression coefficients and their values the predictors such as “zn”,“nox”,“age”,“dis”,“rad”,“tax”, “pratio”,“black”,“medv” plays a significant role in predicting the crime rate in a specific town in boston

glm.boston.probs = predict(glm.boston.fit,df.boston.test,type = "response")
glm.boston.pred= rep(0,100)
glm.boston.pred[glm.boston.probs>0.5]=1
table(glm.boston.pred,crime01.test)
##                crime01.test
## glm.boston.pred  0  1
##               0 46  4
##               1  4 46
mean(glm.boston.pred==crime01.test)
## [1] 0.92
mean(glm.boston.pred!=crime01.test)
## [1] 0.08

Test Error Rate for logistic regression method of classification is 8% for predicting the crime rate in Boston using all the predictors in Boston data set

LDA Method of classification

lda.boston.fit = lda(crime01~zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,data = df.boston.train)
lda.boston.fit
## Call:
## lda(crime01 ~ zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat + medv, data = df.boston.train)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##          zn     indus       chas       nox       rm      age      dis      rad
## 0 20.756158  7.135172 0.05418719 0.4733837 6.381084 51.76207 4.964841  4.20197
## 1  1.103448 15.360345 0.07881773 0.6390197 6.129340 85.69704 2.518060 14.91626
##        tax  ptratio    black     lstat     medv
## 0 309.4335 17.97389 388.3729  9.521133 24.79409
## 1 511.4877 19.06552 320.4518 16.007931 19.84926
## 
## Coefficients of linear discriminants:
##                  LD1
## zn      -0.006348178
## indus    0.022904003
## chas    -0.098102870
## nox      8.203944699
## rm       0.116486695
## age      0.012924549
## dis      0.090133261
## rad      0.082664716
## tax     -0.001463969
## ptratio  0.051083556
## black   -0.001219957
## lstat    0.004246550
## medv     0.033644473
lda.boston.pred= predict(lda.boston.fit,df.boston.test)
names(lda.boston.pred)
## [1] "class"     "posterior" "x"
table(lda.boston.pred$class,crime01.test)
##    crime01.test
##      0  1
##   0 47 12
##   1  3 38
mean(lda.boston.pred$class==crime01.test)
## [1] 0.85
mean(lda.boston.pred$class!=crime01.test)
## [1] 0.15

Test Error Rate for LDA method of classification is 15%

QDA Method of classification

qda.boston.fit = qda(crime01~zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,data = df.boston.train)
qda.boston.fit
## Call:
## qda(crime01 ~ zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat + medv, data = df.boston.train)
## 
## Prior probabilities of groups:
##   0   1 
## 0.5 0.5 
## 
## Group means:
##          zn     indus       chas       nox       rm      age      dis      rad
## 0 20.756158  7.135172 0.05418719 0.4733837 6.381084 51.76207 4.964841  4.20197
## 1  1.103448 15.360345 0.07881773 0.6390197 6.129340 85.69704 2.518060 14.91626
##        tax  ptratio    black     lstat     medv
## 0 309.4335 17.97389 388.3729  9.521133 24.79409
## 1 511.4877 19.06552 320.4518 16.007931 19.84926
qda.boston.pred= predict(qda.boston.fit,df.boston.test)
names(qda.boston.pred)
## [1] "class"     "posterior"
table(qda.boston.pred$class,crime01.test)
##    crime01.test
##      0  1
##   0 50  7
##   1  0 43
mean(qda.boston.pred$class==crime01.test)
## [1] 0.93
mean(qda.boston.pred$class!=crime01.test)
## [1] 0.07

Test Error for QDA method of classification is 7%

KNN method of Classification

train2_boston_matrix = data.matrix(df.boston.train[,c("zn","indus","chas","nox","rm","age","dis","rad","tax","ptratio","black","lstat","medv")])
test2_boston_matrix = data.matrix(df.boston.test[,c("zn","indus","chas","nox","rm","age","dis","rad","tax","ptratio","black","lstat","medv")])
train2_boston_y = data.matrix(df.boston.train$crime01)
set.seed(1)
knn.boston.pred=knn(train2_boston_matrix,test2_boston_matrix,train2_boston_y,k=1)
table(knn.boston.pred,crime01.test)
##                crime01.test
## knn.boston.pred  0  1
##               0 49  6
##               1  1 44
mean(knn.boston.pred==crime01.test)
## [1] 0.93
mean(knn.boston.pred!=crime01.test)
## [1] 0.07

Test error When K=1 in KNN method is 7%

When K=3

set.seed(1)
knn.boston.pred1=knn(train2_boston_matrix,test2_boston_matrix,train2_boston_y,k=3)
table(knn.boston.pred1,crime01.test)
##                 crime01.test
## knn.boston.pred1  0  1
##                0 46  5
##                1  4 45
mean(knn.boston.pred1==crime01.test)
## [1] 0.91
mean(knn.boston.pred1!=crime01.test)
## [1] 0.09

Test error When K=3 in KNN method is 9%

When K=7

set.seed(1)
knn.boston.pred2=knn(train2_boston_matrix,test2_boston_matrix,train2_boston_y,k=7)
table(knn.boston.pred2,crime01.test)
##                 crime01.test
## knn.boston.pred2  0  1
##                0 48  5
##                1  2 45
mean(knn.boston.pred2==crime01.test)
## [1] 0.93
mean(knn.boston.pred2!=crime01.test)
## [1] 0.07

Test error When K=7 in KNN method is 7%

As per the above analysis QDA method classifies the Boston town as crime town or not relatively correct close to 93% based on all the predictors available in the Boston data set