Machine Learning Homework 1011

CHEN, PO-YU / STUDENT ID : 606890100

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with “Lag2” as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

(e) Repeat (d) using LDA.

library(ISLR)
library(MASS)
attach(Weekly)
train <- (Year <= 2008)
W.2008 <- Weekly[!train, ]
D.2008 <- Direction[!train]
lda.fit <- lda(Direction~Lag2, data = Weekly, subset = train)
lda.fit

## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

plot(lda.fit)

lda.p <- predict(lda.fit, W.2008)
names(lda.p)

## [1] "class"     "posterior" "x"

lda.c <- lda.p$class
table(lda.c, D.2008)

##       D.2008
## lda.c  Down Up
##   Down    9  5
##   Up     34 56

mean(lda.c == D.2008)

## [1] 0.625

sum(lda.p$posterior[ , 1] >= .5)

## [1] 14

sum(lda.p$posterior[ , 1] < .5)

## [1] 90

lda.p$posterior[1 : 20, 1]

##       986       987       988       989       990       991       992 
## 0.4736555 0.3558617 0.5132860 0.5142948 0.4799727 0.4597586 0.3771117 
##       993       994       995       996       997       998       999 
## 0.5184724 0.5480397 0.5146118 0.5504246 0.3055404 0.4268160 0.3637275 
##      1000      1001      1002      1003      1004      1005 
## 0.4034316 0.4256310 0.4277053 0.4548626 0.4308002 0.3674066

lda.c[1 : 20]

##  [1] Up   Up   Down Down Up   Up   Up   Down Down Down Down Up   Up   Up  
## [15] Up   Up   Up   Up   Up   Up  
## Levels: Down Up

sum(lda.p$posterior[ , 1] > .9)

## [1] 0

(h) Which of these methods appears to provide the best results on this data ?

Comparing “Logistic Regression Model” with “Linear Discriminant Analysis”, we find that the results on Weeekly are same. Both of methods appear 0.625 of accuracy ratio for prediction.

train <- (Year <= 2008)
W.2008 <- Weekly[!train, ]
D.2008 <- Direction[!train]
glm.fit <- glm(Direction~Lag2, data = Weekly, family = binomial,subset = train)
glm.p <- predict(glm.fit, W.2008, type = "response")
glm.pred <- rep("Down", 104)
glm.pred[glm.p > .5] <- "Up"
table(glm.pred, D.2008)

##         D.2008
## glm.pred Down Up
##     Down    9  5
##     Up     34 56

mean(glm.pred == D.2008)

## [1] 0.625

Moreover, I want to make sure that whether this pattern is just appeared the same results by Weekly precisely, I try Smarket and use both methods to test it. Then I also find that both of methods appear 0.4801587 of accuracy ratio for prediction.

attach(Smarket)

## The following objects are masked from Weekly:
## 
##     Direction, Lag1, Lag2, Lag3, Lag4, Lag5, Today, Volume, Year

train <- (Year < 2005)
S.2005 <- Smarket[!train, ]
D.2005 <- Direction[!train]
glm.fit <- glm(Direction~Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial,subset = train)
glm.p <- predict(glm.fit, S.2005, type = "response")
glm.pred <- rep("Down", 252)
glm.pred[glm.p > .5] <- "Up"
table(glm.pred, D.2005)

##         D.2005
## glm.pred Down Up
##     Down   77 97
##     Up     34 44

mean(glm.pred == D.2005)

## [1] 0.4801587

train <- (Year < 2005)
S.2005 <- Smarket[!train, ]
D.2005 <- Direction[!train]
lda.fit <- lda(Direction~Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, subset = train)
lda.p <- predict(lda.fit, S.2005)
lda.c <- lda.p$class
table(lda.c, D.2005)

##       D.2005
## lda.c  Down Up
##   Down   77 97
##   Up     34 44

mean(lda.c == D.2005)

## [1] 0.4801587

In conclusion, according to the tests above, for Weekly, we may understand that LRM and LDA appear to provide the same results on it. Otherwise, we also understand that the accuracy ratio for prediction of these two methods are almost same.

Machine Learning Homework 1011

CHEN PO YU

2017年10月15日

CHEN, PO-YU / STUDENT ID : 606890100

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with “Lag2” as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

(e) Repeat (d) using LDA.

(h) Which of these methods appears to provide the best results on this data ?