This question should be answered using the weekly data set which is a part of ISLR package and it contains 1089 weekly returns from the beginning of 1990 to the end of 2010
Produce some numerical and graphical summaries of the weekly data. Do there appear to be any patterns
names(Weekly)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
attach(Weekly)
head(Weekly)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly)
Based on the Correlation plot we cant find much information about the variables that are correlated with each other. So we will find out the correlation coefficient
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
As per the Correlation coefficient,all the coefficients between current weeks return and previous week return is very less and it is close to zero.It implies that there is very little correlation between the current week return and previous weeks return. The only noticeable correlation is between volume and year
plot(Volume)
As per the above the Volume of the traded share on each week from 1990 to 2010 is on the upward direction only.
Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus volume as predictors. Use the summary function to predict the results. Do any of the predictors appear to be statistically significant? If so, which ones?
glm.weekly.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,family = binomial,data = Weekly)
summary(glm.weekly.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
As per the logistic regression model, the p- value for the Lag1 predictor is very small compared to the other predictors which implies that the lag1 predictor is significant predictor in this model. Also it has the positive coefficient which implies that if there is a positive return in the previous week, then there will be a upward movement in the current week stock market
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of markets made by logistic regression
glm.weekly.probs= predict(glm.weekly.fit,type = "response")
glm.weekly.pred = rep("Down", 1089)
glm.weekly.pred[glm.weekly.probs>0.5]="Up"
table(glm.weekly.pred,Direction)
## Direction
## glm.weekly.pred Down Up
## Down 54 48
## Up 430 557
mean(glm.weekly.pred==Direction)
## [1] 0.5610652
As per the error rate , only 56.1% predictions are correct. The performance rate is very low . it is little bit higher than the guessing probability. So we can perform the other fitted models to find out the best fit model
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor.Compute the confusion matrix and overall fraction of correct predictions for the held out data
train = (Weekly$Year>=1990&Weekly$Year<=2008)
Weekly.2009.greater = Weekly[!train,]
dim(Weekly.2009.greater)
## [1] 104 9
Direction.2009.greater = Direction[!train]
length(Direction.2009.greater)
## [1] 104
glm.weekly.fit1= glm(Direction~Lag2,family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit1)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
glm.weekly.probs1= predict(glm.weekly.fit1, Weekly.2009.greater, type="response")
glm.weekly.pred1=rep("Down",104)
glm.weekly.pred1[glm.weekly.probs1>0.5]="Up"
table(glm.weekly.pred1,Direction.2009.greater)
## Direction.2009.greater
## glm.weekly.pred1 Down Up
## Down 9 5
## Up 34 56
mean(glm.weekly.pred1 == Direction.2009.greater)
## [1] 0.625
When we fit the model only with the previous 2 weeks return , the performance is increased from 52.5% to 62.5% Compared to the full model , this model shows the better performance
Repeat (d) using LDA
lda.fit = lda(Direction~Lag2,data = Weekly,subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
lda.pred= predict(lda.fit,Weekly.2009.greater)
table(lda.pred$class,Direction.2009.greater)
## Direction.2009.greater
## Down Up
## Down 9 5
## Up 34 56
mean(lda.pred$class==Direction.2009.greater)
## [1] 0.625
LDA performance rate is same as logistic regression performance rate
Repeat (d) using QDA
qda.fit=qda(Direction~Lag2,data = Weekly,subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.pred = predict(qda.fit,Weekly.2009.greater)
qda.pred$class
## [1] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [26] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [51] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [76] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [101] Up Up Up Up
## Levels: Down Up
table(qda.pred$class,Direction.2009.greater)
## Direction.2009.greater
## Down Up
## Down 0 0
## Up 43 61
mean(qda.pred$class==Direction.2009.greater)
## [1] 0.5865385
Compared to LDA performance rate and logistic regression performance rate, QDA performance rate is little bit less, which makes the performance rate is more or less in close with the guess estimate
Repeat d using KNN with K=1
train.weekly.x = Weekly[train,3]
dim(train.weekly.x)=c(985,1)
test.weekly.x = Weekly[!train,3]
dim(test.weekly.x)=c(104,1)
train.weekly.direction = Direction[train]
set.seed(1)
knn.weekly.pred1=knn(train.weekly.x,test.weekly.x,train.weekly.direction,k=1)
table(knn.weekly.pred1,Direction.2009.greater)
## Direction.2009.greater
## knn.weekly.pred1 Down Up
## Down 21 30
## Up 22 31
mean(knn.weekly.pred1==Direction.2009.greater)
## [1] 0.5
Compared to LDA performance rate,logistic regression performance rate and QDA performance rate, KNN performance rate is less which makes the performance rate in close with the guess estimate so this model is not a best fit model.
Which of these methods appear to be provide the best results on this data
By comparing all the performance rate for logistic regression, LDA, QDA and KNN with K-1 for the model which contains the response variable as Direction and predictor as Lag2, the Logistic and LDA provides the best performance rate of 62.5% .So I am concluding here that logistic regression for this model will yield the best results
Experiment with different combinations of predictors, including possible transformations and interactions for each of the methods. Report the variables and associated confusion matrix that appears to provide best results on the held out data
set.seed(1)
knn.weekly.pred2=knn(train.weekly.x,test.weekly.x,train.weekly.direction,k=5)
table(knn.weekly.pred2,Direction.2009.greater)
## Direction.2009.greater
## knn.weekly.pred2 Down Up
## Down 16 21
## Up 27 40
mean(knn.weekly.pred2==Direction.2009.greater)
## [1] 0.5384615
set.seed(1)
knn.weekly.pred3=knn(train.weekly.x,test.weekly.x,train.weekly.direction,k=9)
table(knn.weekly.pred3,Direction.2009.greater)
## Direction.2009.greater
## knn.weekly.pred3 Down Up
## Down 17 20
## Up 26 41
mean(knn.weekly.pred3==Direction.2009.greater)
## [1] 0.5576923
Increasing the K values doesn’t so much improvement in the performance rate, so I am trying with the Different models
glm.weekly.fit2= glm(Direction~Lag1+Lag2,family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit2)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6149 -1.2565 0.9989 1.0875 1.5330
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.21109 0.06456 3.269 0.00108 **
## Lag1 -0.05421 0.02886 -1.878 0.06034 .
## Lag2 0.05384 0.02905 1.854 0.06379 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1347.0 on 982 degrees of freedom
## AIC: 1353
##
## Number of Fisher Scoring iterations: 4
glm.weekly.probs2= predict(glm.weekly.fit2, Weekly.2009.greater, type="response")
glm.weekly.pred2=rep("Down",104)
glm.weekly.pred2[glm.weekly.probs2>0.5]="Up"
table(glm.weekly.pred2,Direction.2009.greater)
## Direction.2009.greater
## glm.weekly.pred2 Down Up
## Down 7 8
## Up 36 53
mean(glm.weekly.pred2==Direction.2009.greater)
## [1] 0.5769231
glm.weekly.fit3= glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5,family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit3)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5, family = binomial,
## data = Weekly, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8534 -1.2491 0.9941 1.0886 1.5126
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.220141 0.065069 3.383 0.000716 ***
## Lag1 -0.055438 0.029051 -1.908 0.056357 .
## Lag2 0.053290 0.029348 1.816 0.069401 .
## Lag3 -0.009292 0.029265 -0.318 0.750859
## Lag4 -0.024484 0.028970 -0.845 0.398036
## Lag5 -0.031544 0.029005 -1.088 0.276796
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1345.1 on 979 degrees of freedom
## AIC: 1357.1
##
## Number of Fisher Scoring iterations: 4
glm.weekly.probs3= predict(glm.weekly.fit3, Weekly.2009.greater, type="response")
glm.weekly.pred3=rep("Down",104)
glm.weekly.pred3[glm.weekly.probs3>0.5]="Up"
table(glm.weekly.pred3,Direction.2009.greater)
## Direction.2009.greater
## glm.weekly.pred3 Down Up
## Down 10 14
## Up 33 47
mean(glm.weekly.pred3==Direction.2009.greater)
## [1] 0.5480769
glm.weekly.fit4= glm(Direction~Lag1,family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit4)
##
## Call:
## glm(formula = Direction ~ Lag1, family = binomial, data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.519 -1.253 1.028 1.094 1.281
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.21829 0.06438 3.391 0.000697 ***
## Lag1 -0.05908 0.02892 -2.043 0.041059 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.4 on 983 degrees of freedom
## AIC: 1354.4
##
## Number of Fisher Scoring iterations: 4
glm.weekly.probs4= predict(glm.weekly.fit4, Weekly.2009.greater, type="response")
glm.weekly.pred4=rep("Down",104)
glm.weekly.pred4[glm.weekly.probs4>0.5]="Up"
table(glm.weekly.pred4,Direction.2009.greater)
## Direction.2009.greater
## glm.weekly.pred4 Down Up
## Down 4 6
## Up 39 55
mean(glm.weekly.pred4==Direction.2009.greater)
## [1] 0.5673077
glm.weekly.fit5= glm(Direction~Lag1+I(Lag1^2),family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit5)
##
## Call:
## glm(formula = Direction ~ Lag1 + I(Lag1^2), family = binomial,
## data = Weekly, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.529 -1.253 1.029 1.094 1.273
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2166882 0.0687066 3.154 0.00161 **
## Lag1 -0.0588584 0.0291346 -2.020 0.04336 *
## I(Lag1^2) 0.0003167 0.0047503 0.067 0.94684
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.4 on 982 degrees of freedom
## AIC: 1356.4
##
## Number of Fisher Scoring iterations: 4
glm.weekly.probs5= predict(glm.weekly.fit5, Weekly.2009.greater, type="response")
glm.weekly.pred5=rep("Down",104)
glm.weekly.pred5[glm.weekly.probs5>0.5]="Up"
table(glm.weekly.pred5,Direction.2009.greater)
## Direction.2009.greater
## glm.weekly.pred5 Down Up
## Down 4 5
## Up 39 56
mean(glm.weekly.pred5==Direction.2009.greater)
## [1] 0.5769231
glm.weekly.fit6= glm(Direction~Lag2+I(Lag2^2),family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit6)
##
## Call:
## glm(formula = Direction ~ Lag2 + I(Lag2^2), family = binomial,
## data = Weekly, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.791 -1.253 1.005 1.100 1.196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.179006 0.068357 2.619 0.00883 **
## Lag2 0.064920 0.029734 2.183 0.02901 *
## I(Lag2^2) 0.004713 0.004569 1.031 0.30236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1349.4 on 982 degrees of freedom
## AIC: 1355.4
##
## Number of Fisher Scoring iterations: 4
glm.weekly.probs6= predict(glm.weekly.fit6, Weekly.2009.greater, type="response")
glm.weekly.pred6=rep("Down",104)
glm.weekly.pred6[glm.weekly.probs6>0.5]="Up"
table(glm.weekly.pred6,Direction.2009.greater)
## Direction.2009.greater
## glm.weekly.pred6 Down Up
## Down 8 4
## Up 35 57
mean(glm.weekly.pred6==Direction.2009.greater)
## [1] 0.625
glm.weekly.fit7= glm(Direction~Lag1+I(Lag2^2),family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit7)
##
## Call:
## glm(formula = Direction ~ Lag1 + I(Lag2^2), family = binomial,
## data = Weekly, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.602 -1.251 1.025 1.094 1.275
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.202658 0.068161 2.973 0.00295 **
## Lag1 -0.060183 0.029104 -2.068 0.03865 *
## I(Lag2^2) 0.003120 0.004518 0.691 0.48982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1349.9 on 982 degrees of freedom
## AIC: 1355.9
##
## Number of Fisher Scoring iterations: 4
glm.weekly.probs7= predict(glm.weekly.fit7, Weekly.2009.greater, type="response")
glm.weekly.pred7=rep("Down",104)
glm.weekly.pred7[glm.weekly.probs7>0.5]="Up"
table(glm.weekly.pred7,Direction.2009.greater)
## Direction.2009.greater
## glm.weekly.pred7 Down Up
## Down 6 6
## Up 37 55
mean(glm.weekly.pred7==Direction.2009.greater)
## [1] 0.5865385
glm.weekly.fit8= glm(Direction~Lag2+I(Lag1^2),family = binomial,data = Weekly,subset = train)
summary(glm.weekly.fit8)
##
## Call:
## glm(formula = Direction ~ Lag2 + I(Lag1^2), family = binomial,
## data = Weekly, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.587 -1.259 1.015 1.093 1.338
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.181618 0.068622 2.647 0.00813 **
## Lag2 0.065085 0.029785 2.185 0.02888 *
## I(Lag1^2) 0.004079 0.004583 0.890 0.37339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1349.7 on 982 degrees of freedom
## AIC: 1355.7
##
## Number of Fisher Scoring iterations: 4
glm.weekly.probs8= predict(glm.weekly.fit8, Weekly.2009.greater, type="response")
glm.weekly.pred8=rep("Down",104)
glm.weekly.pred8[glm.weekly.probs8>0.5]="Up"
table(glm.weekly.pred8,Direction.2009.greater)
## Direction.2009.greater
## glm.weekly.pred8 Down Up
## Down 8 2
## Up 35 59
mean(glm.weekly.pred8==Direction.2009.greater)
## [1] 0.6442308
Out of the above all models, Model with Lag2 and Lag1^2 shows the better performance rate of predicting the correct stock market direction.Out of 104 day 64 days are predicted correctly by this model
detach(Weekly)
In this problem you will develop a model to predict within whether a given car gets high or low gas mileage based on the Auto Data set
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
attach(Auto)
## The following object is masked from package:ggplot2:
##
## mpg
dim(Auto)
## [1] 392 9
Create a binary variable mpg01 that contains a 1 if mpg contains a value above its median and a 0 if mpg contains a value below its median.You can compute the median using median() function.
mpg01 = rep(0, 392)
mpg01[mpg>median(mpg)]=1
df.auto = data.frame(Auto,mpg01)
View(df.auto)
Explore the data graphically in order to investigate the association between mpg01 and other features. which of the other features are most likely to be useful in predicting mpg01? scatter plots and box plots may be useful tools to answer this question.Describe your findings
pairs(df.auto)
names(df.auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name" "mpg01"
cor(df.auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
As per the scatter plots and correlation coefficient mpg01 is highly correlated in positive direction with mpg and negative correlated with weight, displacement horsepower and cylinders.Also Mpg is used to derive the value of mpg01 . so I am using the other variables such as weight, displacement, cylinders,and horsepower to predict the best fit model
Split the data into training set and test set
set.seed(123)
trainIndex = createDataPartition(df.auto$mpg01,p=0.8,list=FALSE)
df.auto.train = df.auto[trainIndex,]
df.auto.test = df.auto[-trainIndex,]
mpg01.train=df.auto[trainIndex,10]
mpg01.test= factor(df.auto[-trainIndex,10])
Perform LDA on the training data in order to predict mpg01 using the variables that seemed to be more associated with mpg01 in b. What is the test error of the model obtained
detach(Auto)
attach(df.auto)
## The following object is masked _by_ .GlobalEnv:
##
## mpg01
## The following object is masked from package:ggplot2:
##
## mpg
lda.auto.fit = lda(mpg01~weight+displacement+cylinders+horsepower,data = df.auto.train)
lda.auto.fit
## Call:
## lda(mpg01 ~ weight + displacement + cylinders + horsepower, data = df.auto.train)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## weight displacement cylinders horsepower
## 0 3606.701 275.4968 6.808917 130.59873
## 1 2345.070 116.4108 4.197452 79.09554
##
## Coefficients of linear discriminants:
## LD1
## weight -0.0008475776
## displacement -0.0025319858
## cylinders -0.5010010916
## horsepower 0.0049331691
lda.auto.pred= predict(lda.auto.fit,df.auto.test)
names(lda.auto.pred)
## [1] "class" "posterior" "x"
table(lda.auto.pred$class,mpg01.test)
## mpg01.test
## 0 1
## 0 31 1
## 1 8 38
mean(lda.auto.pred$class==mpg01.test)
## [1] 0.8846154
mean(lda.auto.pred$class!=mpg01.test)
## [1] 0.1153846
Test Error Rate for Linear Discriminant Analysis method is 11.5%
Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in b.what is the test error of the model obtained
qda.auto.fit = qda(mpg01~weight+displacement+cylinders+horsepower,data = df.auto.train)
qda.auto.fit
## Call:
## qda(mpg01 ~ weight + displacement + cylinders + horsepower, data = df.auto.train)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## weight displacement cylinders horsepower
## 0 3606.701 275.4968 6.808917 130.59873
## 1 2345.070 116.4108 4.197452 79.09554
qda.auto.pred= predict(qda.auto.fit,df.auto.test)
table(qda.auto.pred$class,mpg01.test)
## mpg01.test
## 0 1
## 0 32 3
## 1 7 36
mean(qda.auto.pred$class==mpg01.test)
## [1] 0.8717949
mean(qda.auto.pred$class!=mpg01.test)
## [1] 0.1282051
Test Error Rate for Quadratic Discriminant Analysis method id 12.8%
Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in b.what is the test error of the model obtained
glm.auto.fit = glm(mpg01~weight+displacement+cylinders+horsepower,family = binomial,data = df.auto.train)
summary(glm.auto.fit)
##
## Call:
## glm(formula = mpg01 ~ weight + displacement + cylinders + horsepower,
## family = binomial, data = df.auto.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2318 -0.1773 0.0487 0.3398 3.4044
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.4010959 1.8995270 6.002 1.95e-09 ***
## weight -0.0016414 0.0008099 -2.027 0.04271 *
## displacement -0.0158997 0.0090122 -1.764 0.07769 .
## cylinders 0.0011353 0.3834886 0.003 0.99764
## horsepower -0.0418884 0.0150907 -2.776 0.00551 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 435.30 on 313 degrees of freedom
## Residual deviance: 163.19 on 309 degrees of freedom
## AIC: 173.19
##
## Number of Fisher Scoring iterations: 7
glm.auto.probs = predict(glm.auto.fit,df.auto.test,type = "response")
glm.auto.pred= rep(0,78)
glm.auto.pred[glm.auto.probs>0.5]=1
table(glm.auto.pred,mpg01.test)
## mpg01.test
## glm.auto.pred 0 1
## 0 31 2
## 1 8 37
mean(glm.auto.pred==mpg01.test)
## [1] 0.8717949
mean(glm.auto.pred!=mpg01.test)
## [1] 0.1282051
Test Error Rate is 12.8% for logistic regression method
perform KNN on the training data, with several values of K in order to predict mpg01.use only the variables that seemed most associated with mpg01 in (b). what test errors do you obtain?which value of K seems to perform the best fit on this data
train2_matrix = data.matrix(df.auto.train[,c("cylinders","displacement","weight","horsepower")])
test2_matrix = data.matrix(df.auto.test[,c("cylinders","displacement","weight","horsepower")])
train2_y = data.matrix(df.auto.train$mpg01)
set.seed(1)
knn.auto.pred=knn(train2_matrix,test2_matrix,train2_y,k=1)
table(knn.auto.pred,mpg01.test)
## mpg01.test
## knn.auto.pred 0 1
## 0 30 2
## 1 9 37
mean(knn.auto.pred==mpg01.test)
## [1] 0.8589744
mean(knn.auto.pred!=mpg01.test)
## [1] 0.1410256
Test error When K=1 in KNN method is 14.1%
set.seed(1)
knn.auto.pred1=knn(train2_matrix,test2_matrix,train2_y,k=3)
table(knn.auto.pred1,mpg01.test)
## mpg01.test
## knn.auto.pred1 0 1
## 0 30 0
## 1 9 39
mean(knn.auto.pred1==mpg01.test)
## [1] 0.8846154
mean(knn.auto.pred1!=mpg01.test)
## [1] 0.1153846
set.seed(1)
knn.auto.pred2=knn(train2_matrix,test2_matrix,train2_y,k=5)
table(knn.auto.pred2,mpg01.test)
## mpg01.test
## knn.auto.pred2 0 1
## 0 31 3
## 1 8 36
mean(knn.auto.pred2==mpg01.test)
## [1] 0.8589744
mean(knn.auto.pred2!=mpg01.test)
## [1] 0.1410256
set.seed(1)
knn.auto.pred3=knn(train2_matrix,test2_matrix,train2_y,k=7)
table(knn.auto.pred3,mpg01.test)
## mpg01.test
## knn.auto.pred3 0 1
## 0 31 1
## 1 8 38
mean(knn.auto.pred3==mpg01.test)
## [1] 0.8846154
mean(knn.auto.pred3!=mpg01.test)
## [1] 0.1153846
As per the above K values,K=3 and K-7 provides the small test error rate and high performance rate for the KNN method for the model with the variables weight, cylinders, displacement and horsepower in order to predict mpg01
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression , LDA, KNN models using various subset of parameters. Describe your findings
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
dim(Boston)
## [1] 506 14
attach(Boston)
crime01 = rep(0, 506)
crime01[crim>median(crim)]=1
df.boston = data.frame(Boston,crime01)
str(df.boston)
## 'data.frame': 506 obs. of 15 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## $ crime01: num 0 0 0 0 0 0 0 0 0 0 ...
pairs(df.boston)
cor(df.boston$crime01,df.boston)
## crim zn indus chas nox rm age
## [1,] 0.4093955 -0.436151 0.6032602 0.07009677 0.7232348 -0.1563718 0.6139399
## dis rad tax ptratio black lstat medv
## [1,] -0.6163416 0.6197862 0.6087413 0.2535684 -0.3512109 0.4532627 -0.2630167
## crime01
## [1,] 1
set.seed(123)
trainIndex = createDataPartition(df.boston$crime01,p=0.8,list=FALSE)
df.boston.train = df.boston[trainIndex,]
df.boston.test = df.boston[-trainIndex,]
crime01.test= factor(df.boston[-trainIndex,15])
detach(Boston)
attach(df.boston)
## The following object is masked _by_ .GlobalEnv:
##
## crime01
glm.boston.fit = glm(crime01~zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,family = binomial,data = df.boston.train)
summary(glm.boston.fit)
##
## Call:
## glm(formula = crime01 ~ zn + indus + chas + nox + rm + age +
## dis + rad + tax + ptratio + black + lstat + medv, family = binomial,
## data = df.boston.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4149 -0.2022 -0.0004 0.0030 3.5210
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -35.258213 7.435012 -4.742 2.11e-06 ***
## zn -0.079313 0.038839 -2.042 0.041141 *
## indus -0.040156 0.049127 -0.817 0.413700
## chas 0.841962 0.845047 0.996 0.319080
## nox 48.999078 8.538917 5.738 9.56e-09 ***
## rm -0.290772 0.780705 -0.372 0.709559
## age 0.032608 0.013783 2.366 0.017990 *
## dis 0.796546 0.246093 3.237 0.001209 **
## rad 0.632139 0.171349 3.689 0.000225 ***
## tax -0.007036 0.002989 -2.354 0.018568 *
## ptratio 0.371711 0.136979 2.714 0.006655 **
## black -0.013372 0.006539 -2.045 0.040858 *
## lstat -0.020232 0.055991 -0.361 0.717845
## medv 0.152260 0.075747 2.010 0.044420 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 562.84 on 405 degrees of freedom
## Residual deviance: 173.89 on 392 degrees of freedom
## AIC: 201.89
##
## Number of Fisher Scoring iterations: 9
As per the logistic regression coefficients and their values the predictors such as “zn”,“nox”,“age”,“dis”,“rad”,“tax”, “pratio”,“black”,“medv” plays a significant role in predicting the crime rate in a specific town in boston
glm.boston.probs = predict(glm.boston.fit,df.boston.test,type = "response")
glm.boston.pred= rep(0,100)
glm.boston.pred[glm.boston.probs>0.5]=1
table(glm.boston.pred,crime01.test)
## crime01.test
## glm.boston.pred 0 1
## 0 46 4
## 1 4 46
mean(glm.boston.pred==crime01.test)
## [1] 0.92
mean(glm.boston.pred!=crime01.test)
## [1] 0.08
Test Error Rate for logistic regression method of classification is 8% for predicting the crime rate in Boston using all the predictors in Boston data set
lda.boston.fit = lda(crime01~zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,data = df.boston.train)
lda.boston.fit
## Call:
## lda(crime01 ~ zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat + medv, data = df.boston.train)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## zn indus chas nox rm age dis rad
## 0 20.756158 7.135172 0.05418719 0.4733837 6.381084 51.76207 4.964841 4.20197
## 1 1.103448 15.360345 0.07881773 0.6390197 6.129340 85.69704 2.518060 14.91626
## tax ptratio black lstat medv
## 0 309.4335 17.97389 388.3729 9.521133 24.79409
## 1 511.4877 19.06552 320.4518 16.007931 19.84926
##
## Coefficients of linear discriminants:
## LD1
## zn -0.006348178
## indus 0.022904003
## chas -0.098102870
## nox 8.203944699
## rm 0.116486695
## age 0.012924549
## dis 0.090133261
## rad 0.082664716
## tax -0.001463969
## ptratio 0.051083556
## black -0.001219957
## lstat 0.004246550
## medv 0.033644473
lda.boston.pred= predict(lda.boston.fit,df.boston.test)
names(lda.boston.pred)
## [1] "class" "posterior" "x"
table(lda.boston.pred$class,crime01.test)
## crime01.test
## 0 1
## 0 47 12
## 1 3 38
mean(lda.boston.pred$class==crime01.test)
## [1] 0.85
mean(lda.boston.pred$class!=crime01.test)
## [1] 0.15
Test Error Rate for LDA method of classification is 15%
qda.boston.fit = qda(crime01~zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,data = df.boston.train)
qda.boston.fit
## Call:
## qda(crime01 ~ zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat + medv, data = df.boston.train)
##
## Prior probabilities of groups:
## 0 1
## 0.5 0.5
##
## Group means:
## zn indus chas nox rm age dis rad
## 0 20.756158 7.135172 0.05418719 0.4733837 6.381084 51.76207 4.964841 4.20197
## 1 1.103448 15.360345 0.07881773 0.6390197 6.129340 85.69704 2.518060 14.91626
## tax ptratio black lstat medv
## 0 309.4335 17.97389 388.3729 9.521133 24.79409
## 1 511.4877 19.06552 320.4518 16.007931 19.84926
qda.boston.pred= predict(qda.boston.fit,df.boston.test)
names(qda.boston.pred)
## [1] "class" "posterior"
table(qda.boston.pred$class,crime01.test)
## crime01.test
## 0 1
## 0 50 7
## 1 0 43
mean(qda.boston.pred$class==crime01.test)
## [1] 0.93
mean(qda.boston.pred$class!=crime01.test)
## [1] 0.07
Test Error for QDA method of classification is 7%
train2_boston_matrix = data.matrix(df.boston.train[,c("zn","indus","chas","nox","rm","age","dis","rad","tax","ptratio","black","lstat","medv")])
test2_boston_matrix = data.matrix(df.boston.test[,c("zn","indus","chas","nox","rm","age","dis","rad","tax","ptratio","black","lstat","medv")])
train2_boston_y = data.matrix(df.boston.train$crime01)
set.seed(1)
knn.boston.pred=knn(train2_boston_matrix,test2_boston_matrix,train2_boston_y,k=1)
table(knn.boston.pred,crime01.test)
## crime01.test
## knn.boston.pred 0 1
## 0 49 6
## 1 1 44
mean(knn.boston.pred==crime01.test)
## [1] 0.93
mean(knn.boston.pred!=crime01.test)
## [1] 0.07
Test error When K=1 in KNN method is 7%
set.seed(1)
knn.boston.pred1=knn(train2_boston_matrix,test2_boston_matrix,train2_boston_y,k=3)
table(knn.boston.pred1,crime01.test)
## crime01.test
## knn.boston.pred1 0 1
## 0 46 5
## 1 4 45
mean(knn.boston.pred1==crime01.test)
## [1] 0.91
mean(knn.boston.pred1!=crime01.test)
## [1] 0.09
Test error When K=3 in KNN method is 9%
set.seed(1)
knn.boston.pred2=knn(train2_boston_matrix,test2_boston_matrix,train2_boston_y,k=7)
table(knn.boston.pred2,crime01.test)
## crime01.test
## knn.boston.pred2 0 1
## 0 48 5
## 1 2 45
mean(knn.boston.pred2==crime01.test)
## [1] 0.93
mean(knn.boston.pred2!=crime01.test)
## [1] 0.07
Test error When K=7 in KNN method is 7%
As per the above analysis QDA method classifies the Boston town as crime town or not relatively correct close to 93% based on all the predictors available in the Boston data set