library(ISLR2)
The Smarket
data is part of the ISLR2
library. This data set consists of percentage returns for the S&P
500 stock index over 1, 250 days, from the beginning of 2001 until the
end of 2005. For each date, we have recorded the percentage returns for
each of the five previous trading days, Lag1
through
Lag5.
We have also recorded Volume
(the number
of shares traded on the previous day, in billions), Today
(the percentage return on the date in question) and Direction (whether
the market was Up
or Down
on this date). Our
goal is to predict Direction
(a qualitative response) using
the other features.
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
dim(Smarket)
## [1] 1250 9
summary(Smarket)
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
##
##
##
##
The cor()
function produces a matrix that contains all
of the pairwise correlations among the predictors in a data set.
cor(Smarket[, -9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
# Remove the Direction variable because it is qualitative data
The correlations between the lag variables and today’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns.
attach(Smarket)
plot(Volume)
By plotting the data, which is ordered chronologically, we see that
Volume
is increasing over time. In other words, the average
number of shares traded daily increased from 2001 to 2005.
Next, we will ft a logistic regression model in order to predict
Direction
using Lag1
through Lag5
and Volume.
glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial, data = Smarket)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
The smallest p-value here is associated with Lag1.
The
negative coefficient for this predictor suggests that if the market had
a positive return yesterday, then it is less likely to go up today.
However, at a value of 0.15, the p-value is still relatively large, and
so there is no clear evidence of a real association between
Lag1
and Direction
.
We use the coef()
function in order to access just the
coefficients for this fitted model. We can also use the
summary()
function to access particular aspects of the
fitted model, such as the p-values for the coefficients.
coef(glm.fits)
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.010313068
## Volume
## 0.135440659
summary(glm.fits)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1 -0.073073746 0.05016739 -1.4565986 0.1452272
## Lag2 -0.042301344 0.05008605 -0.8445733 0.3983491
## Lag3 0.011085108 0.04993854 0.2219750 0.8243333
## Lag4 0.009358938 0.04997413 0.1872757 0.8514445
## Lag5 0.010313068 0.04951146 0.2082966 0.8349974
## Volume 0.135440659 0.15835970 0.8552723 0.3924004
summary(glm.fits)$coef[, 4]
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.6006983 0.1452272 0.3983491 0.8243333 0.8514445 0.8349974
## Volume
## 0.3924004
The predict()
function can be used to predict the
probability that the market will go up, given values of the predictors.
The type = "response"
option tells R
to output
probabilities of the form P(Y = 1|X), as opposed to other information
such as the logit. If no data set is supplied to the
predict()
function, then the probabilities are computed for
the training data that was used to ft the logistic regression model.
Here we have printed only the frst ten probabilities. We know that these
values correspond to the probability of the market going up, rather than
down, because thecontrasts()
function indicates that
R
has created a dummy variable with a 1 for
Up
.
glm.probs <- predict(glm.fits, type = 'response')
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292
## 9 10
## 0.5176135 0.4888378
contrasts(Direction)
## Up
## Down 0
## Up 1
In order to make a prediction as to whether the market will go up or
down on a particular day, we must convert these predicted probabilities
into class labels, Up
or Down.
The following
two commands create a vector of class predictions based on whether the
predicted probability of a market increase is greater than or less than
0.5.
glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"
The frst command creates a vector of 1,250 Down elements. The second line transforms to Up all of the elements for which the predicted probability of a market increase exceeds 0.5. Given these predictions, the table() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classifed.
table(glm.pred, Direction)
## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507
(507 + 145) / 1250
## [1] 0.5216
mean(glm.pred == Direction)
## [1] 0.5216
The mean() function can be used to compute the fraction of days for
which the prediction was correct. In this case, logistic regression
correctly predicted the movement of the market 52.2 % of the time.
At first glance, it appears that the logistic regression model is
working a little better than random guessing. However, this result is
misleading because we trained and tested the model on the same set of 1,
250 observations. In other words, 100% − 52.2% = 47.8%, is the training
error rate. As we have seen previously, the training error rate is often
overly optimistic—it tends to underestimate the test error rate.
In order to better assess the accuracy of the logistic regression model
in this setting, we can fit the model using part of the data, and then
examine how well it predicts the held out data. To implement this
strategy, we will first create a vector corresponding to the
observations from 2001 through 2004. We will then use this vector to
create a held out data set of observations from 2005.
train <- (Year < 2005)
Smarket.2005 <- Smarket[!train, ]
dim(Smarket.2005)
## [1] 252 9
Direction.2005 <- Direction[!train]
We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set—that is, for the days in 2005.
glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial, subset = train)
glm.probs <- predict(glm.fits, Smarket.2005, type = "response")
Finally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
mean(glm.pred == Direction.2005)
## [1] 0.4801587
mean(glm.pred != Direction.2005)
## [1] 0.5198413
The != notation means not equal to, and so the last command computes
the test set error rate. The results are rather disappointing: the test
error rate is 52 %, which is worse than random guessing! Of course this
result is not all that surprising, given that one would not generally
expect to be able to use previous days’ returns to predict future market
performance.
Below we have refit the logistic regression using just Lag1 and Lag2,
which seemed to have the highest predictive power in the original
logistic regression model.
glm.fits <- glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial, subset = train)
glm.probs <- predict(glm.fits, Smarket.2005, type = 'response')
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 35 35
## Up 76 106
mean(glm.pred == Direction.2005)
## [1] 0.5595238
106 / (106 + 76)
## [1] 0.5824176
Now the results appear to be a little better: 56% of the daily
movements have been correctly predicted. It is worth noting that in this
case, a much simpler strategy of predicting that the market will
increase every day will also be correct 56% of the time! Hence, in terms
of overall error rate, the logistic regression method is no better than
the naive approach. However, the confusion matrix shows that on days
when logistic regression predicts an increase in the market, it has a
58% accuracy rate. This suggests a possible trading strategy of buying
on days when the model predicts an increasing market, and avoiding
trades on days when a decrease is predicted. Of course one would need to
investigate more carefully whether this small improvement was real or
just due to random chance.
Suppose that we want to predict the returns associated with particular
values of Lag1
and Lag2.
In particular, we
want to predict Direction
on a day when Lag1
and Lag2
equal 1.2 and 1.1, respectively, and on a day when
they equal 1.5 and −0.8. We do this using the predict()
function.
predict(glm.fits, newdata = data.frame(Lag1 = c(1.2, 1.5), Lag2 = c(1.1, -0.8)), type = "response")
## 1 2
## 0.4791462 0.4960939