This document replicates Sections 4.1 and 4.2 of the
ISLR classification lab (https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/04-classification.html)
using the Smarket stock-market data. The lab webpage uses
the tidymodels framework; here we reproduce exactly the same
analysis and results with base R and
glm(), which is what tidymodels uses underneath.
The numbers below match the lab.
Only one small package is needed (install once):
install.packages("ISLR")
library(ISLR) # provides the Smarket data set
The Smarket data set records daily percentage returns
for a stock-market index: the previous five days’ returns
(Lag1–Lag5), trading Volume, the
Year, today’s return (Today), and the
qualitative response Direction
("Up"/"Down").
str(Smarket)
## 'data.frame': 1250 obs. of 9 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...
## $ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...
## $ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...
## $ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...
## $ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...
## $ Volume : num 1.19 1.3 1.41 1.28 1.21 ...
## $ Today : num 0.959 1.032 -0.623 0.614 0.213 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
summary(Smarket)
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
##
##
##
##
Direction is the 9th column and is not numeric, so we
exclude it before computing the correlation matrix.
cor_mat <- cor(Smarket[, -9])
round(cor_mat, 3)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
## Year 1.000 0.030 0.031 0.033 0.036 0.030 0.539 0.030
## Lag1 0.030 1.000 -0.026 -0.011 -0.003 -0.006 0.041 -0.026
## Lag2 0.031 -0.026 1.000 -0.026 -0.011 -0.004 -0.043 -0.010
## Lag3 0.033 -0.011 -0.026 1.000 -0.024 -0.019 -0.042 -0.002
## Lag4 0.036 -0.003 -0.011 -0.024 1.000 -0.027 -0.048 -0.007
## Lag5 0.030 -0.006 -0.004 -0.019 -0.027 1.000 -0.022 -0.035
## Volume 0.539 0.041 -0.043 -0.042 -0.048 -0.022 1.000 0.015
## Today 0.030 -0.026 -0.010 -0.002 -0.007 -0.035 0.015 1.000
Almost all pairs of variables are essentially uncorrelated. The one
noticeable exception is Year and Volume.
A quick heat-map of the correlation matrix (uses the optional
corrplot package if installed; otherwise the rounded matrix
above is the summary):
if (requireNamespace("corrplot", quietly = TRUE)) {
corrplot::corrplot(cor_mat, method = "color", addCoef.col = "black",
tl.col = "black", number.cex = 0.7,
col = colorRampPalette(c("indianred2", "white", "skyblue1"))(200))
} else {
message("Install corrplot for a heat-map: install.packages('corrplot')")
round(cor_mat, 3)
}
Plotting Year against Volume confirms that
trading volume drifts upward over time.
plot(jitter(Smarket$Year), Smarket$Volume,
xlab = "Year", ylab = "Volume",
main = "Volume tends to increase over time",
pch = 20, col = "grey40")
We fit a logistic regression of Direction on the five
lagged returns plus Volume, using glm() with
family = binomial.
glm_fit <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket, family = binomial
)
summary(glm_fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
Notice that none of the predictors has a small
p-value — there is no strong evidence that any single lag or
volume helps predict the market direction. (R models the probability of
"Up", the second factor level.)
predict(..., type = "response") returns the predicted
probability of "Up". We classify as "Up" when
that probability exceeds 0.5.
glm_probs <- predict(glm_fit, type = "response")
glm_pred <- rep("Down", nrow(Smarket))
glm_pred[glm_probs > 0.5] <- "Up"
# confusion matrix
table(Predicted = glm_pred, Truth = Smarket$Direction)
## Truth
## Predicted Down Up
## Down 145 141
## Up 457 507
# accuracy
mean(glm_pred == Smarket$Direction)
## [1] 0.5216
The accuracy is only about 0.52 — barely better than guessing — and
the model tends to predict "Up" too often. Evaluating on
the same data used for fitting is also over-optimistic.
A more honest evaluation trains on the years before 2005 and tests on 2005.
train <- Smarket$Year < 2005
Smarket_2005 <- Smarket[!train, ]
Direction_2005 <- Smarket$Direction[!train]
glm_fit2 <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket, family = binomial, subset = train
)
glm_probs2 <- predict(glm_fit2, Smarket_2005, type = "response")
glm_pred2 <- rep("Down", nrow(Smarket_2005))
glm_pred2[glm_probs2 > 0.5] <- "Up"
table(Predicted = glm_pred2, Truth = Direction_2005)
## Truth
## Predicted Down Up
## Down 77 97
## Up 34 44
mean(glm_pred2 == Direction_2005) # test accuracy
## [1] 0.4801587
On genuinely new (2005) data the accuracy drops to about 0.48 — below 50% — confirming how weak this model is out of sample.
Lag1 + Lag2 onlyBecause the extra predictors looked useless, dropping them may reduce variance without adding bias.
glm_fit3 <- glm(Direction ~ Lag1 + Lag2,
data = Smarket, family = binomial, subset = train)
glm_probs3 <- predict(glm_fit3, Smarket_2005, type = "response")
glm_pred3 <- rep("Down", nrow(Smarket_2005))
glm_pred3[glm_probs3 > 0.5] <- "Up"
table(Predicted = glm_pred3, Truth = Direction_2005)
## Truth
## Predicted Down Up
## Down 35 35
## Up 76 106
mean(glm_pred3 == Direction_2005) # test accuracy
## [1] 0.5595238
Accuracy improves to about 0.56 — the simpler model generalises better.
Finally we predict the probability of "Up" for two
hypothetical days: Lag1 = 1.2, Lag2 = 1.1, and
Lag1 = 1.5, Lag2 = -0.8.
newdata <- data.frame(Lag1 = c(1.2, 1.5),
Lag2 = c(1.1, -0.8))
predict(glm_fit3, newdata = newdata, type = "response") # P(Up)
## 1 2
## 0.4791462 0.4960939
Smarket data: the
predictors are almost all uncorrelated, with only a mild
Year–Volume relationship.Lag1 + Lag2 raised test accuracy to roughly 56%, showing
that removing uninformative predictors can help out-of-sample
performance.