This document replicates Sections 4.1 and 4.2 of the ISLR classification lab (https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/04-classification.html) using the Smarket stock-market data. The lab webpage uses the tidymodels framework; here we reproduce exactly the same analysis and results with base R and glm(), which is what tidymodels uses underneath. The numbers below match the lab.

Only one small package is needed (install once): install.packages("ISLR")

library(ISLR)   # provides the Smarket data set

1 4.1 The Stock Market Data

The Smarket data set records daily percentage returns for a stock-market index: the previous five days’ returns (Lag1Lag5), trading Volume, the Year, today’s return (Today), and the qualitative response Direction ("Up"/"Down").

str(Smarket)
## 'data.frame':    1250 obs. of  9 variables:
##  $ Year     : num  2001 2001 2001 2001 2001 ...
##  $ Lag1     : num  0.381 0.959 1.032 -0.623 0.614 ...
##  $ Lag2     : num  -0.192 0.381 0.959 1.032 -0.623 ...
##  $ Lag3     : num  -2.624 -0.192 0.381 0.959 1.032 ...
##  $ Lag4     : num  -1.055 -2.624 -0.192 0.381 0.959 ...
##  $ Lag5     : num  5.01 -1.055 -2.624 -0.192 0.381 ...
##  $ Volume   : num  1.19 1.3 1.41 1.28 1.21 ...
##  $ Today    : num  0.959 1.032 -0.623 0.614 0.213 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
summary(Smarket)
##       Year           Lag1                Lag2                Lag3          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
##  Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
##       Lag4                Lag5              Volume           Today          
##  Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
##  1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
##  Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
##  Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
##  3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
##  Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
##  Direction 
##  Down:602  
##  Up  :648  
##            
##            
##            
## 

1.1 Correlation between the numeric variables

Direction is the 9th column and is not numeric, so we exclude it before computing the correlation matrix.

cor_mat <- cor(Smarket[, -9])
round(cor_mat, 3)
##         Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today
## Year   1.000  0.030  0.031  0.033  0.036  0.030  0.539  0.030
## Lag1   0.030  1.000 -0.026 -0.011 -0.003 -0.006  0.041 -0.026
## Lag2   0.031 -0.026  1.000 -0.026 -0.011 -0.004 -0.043 -0.010
## Lag3   0.033 -0.011 -0.026  1.000 -0.024 -0.019 -0.042 -0.002
## Lag4   0.036 -0.003 -0.011 -0.024  1.000 -0.027 -0.048 -0.007
## Lag5   0.030 -0.006 -0.004 -0.019 -0.027  1.000 -0.022 -0.035
## Volume 0.539  0.041 -0.043 -0.042 -0.048 -0.022  1.000  0.015
## Today  0.030 -0.026 -0.010 -0.002 -0.007 -0.035  0.015  1.000

Almost all pairs of variables are essentially uncorrelated. The one noticeable exception is Year and Volume.

A quick heat-map of the correlation matrix (uses the optional corrplot package if installed; otherwise the rounded matrix above is the summary):

if (requireNamespace("corrplot", quietly = TRUE)) {
  corrplot::corrplot(cor_mat, method = "color", addCoef.col = "black",
                     tl.col = "black", number.cex = 0.7,
                     col = colorRampPalette(c("indianred2", "white", "skyblue1"))(200))
} else {
  message("Install corrplot for a heat-map: install.packages('corrplot')")
  round(cor_mat, 3)
}

1.2 Year versus Volume

Plotting Year against Volume confirms that trading volume drifts upward over time.

plot(jitter(Smarket$Year), Smarket$Volume,
     xlab = "Year", ylab = "Volume",
     main = "Volume tends to increase over time",
     pch = 20, col = "grey40")

2 4.2 Logistic Regression

2.1 Fitting the model on the full data

We fit a logistic regression of Direction on the five lagged returns plus Volume, using glm() with family = binomial.

glm_fit <- glm(
  Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
  data = Smarket, family = binomial
)
summary(glm_fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Notice that none of the predictors has a small p-value — there is no strong evidence that any single lag or volume helps predict the market direction. (R models the probability of "Up", the second factor level.)

2.2 Predictions and the confusion matrix (training data)

predict(..., type = "response") returns the predicted probability of "Up". We classify as "Up" when that probability exceeds 0.5.

glm_probs <- predict(glm_fit, type = "response")
glm_pred  <- rep("Down", nrow(Smarket))
glm_pred[glm_probs > 0.5] <- "Up"

# confusion matrix
table(Predicted = glm_pred, Truth = Smarket$Direction)
##          Truth
## Predicted Down  Up
##      Down  145 141
##      Up    457 507
# accuracy
mean(glm_pred == Smarket$Direction)
## [1] 0.5216

The accuracy is only about 0.52 — barely better than guessing — and the model tends to predict "Up" too often. Evaluating on the same data used for fitting is also over-optimistic.

2.3 Train / test split by year

A more honest evaluation trains on the years before 2005 and tests on 2005.

train        <- Smarket$Year < 2005
Smarket_2005 <- Smarket[!train, ]
Direction_2005 <- Smarket$Direction[!train]
glm_fit2 <- glm(
  Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
  data = Smarket, family = binomial, subset = train
)

glm_probs2 <- predict(glm_fit2, Smarket_2005, type = "response")
glm_pred2  <- rep("Down", nrow(Smarket_2005))
glm_pred2[glm_probs2 > 0.5] <- "Up"

table(Predicted = glm_pred2, Truth = Direction_2005)
##          Truth
## Predicted Down Up
##      Down   77 97
##      Up     34 44
mean(glm_pred2 == Direction_2005)   # test accuracy
## [1] 0.4801587

On genuinely new (2005) data the accuracy drops to about 0.48 — below 50% — confirming how weak this model is out of sample.

2.4 A smaller model: Lag1 + Lag2 only

Because the extra predictors looked useless, dropping them may reduce variance without adding bias.

glm_fit3 <- glm(Direction ~ Lag1 + Lag2,
                data = Smarket, family = binomial, subset = train)

glm_probs3 <- predict(glm_fit3, Smarket_2005, type = "response")
glm_pred3  <- rep("Down", nrow(Smarket_2005))
glm_pred3[glm_probs3 > 0.5] <- "Up"

table(Predicted = glm_pred3, Truth = Direction_2005)
##          Truth
## Predicted Down  Up
##      Down   35  35
##      Up     76 106
mean(glm_pred3 == Direction_2005)   # test accuracy
## [1] 0.5595238

Accuracy improves to about 0.56 — the simpler model generalises better.

2.5 Predicting for specific new observations

Finally we predict the probability of "Up" for two hypothetical days: Lag1 = 1.2, Lag2 = 1.1, and Lag1 = 1.5, Lag2 = -0.8.

newdata <- data.frame(Lag1 = c(1.2, 1.5),
                      Lag2 = c(1.1, -0.8))
predict(glm_fit3, newdata = newdata, type = "response")  # P(Up)
##         1         2 
## 0.4791462 0.4960939

3 Summary

  • 4.1 explored the Smarket data: the predictors are almost all uncorrelated, with only a mild YearVolume relationship.
  • 4.2 fit a logistic-regression classifier. On the training data its accuracy was about 52%, and on the held-out 2005 data only about 48% — essentially no better than chance. Reducing the model to Lag1 + Lag2 raised test accuracy to roughly 56%, showing that removing uninformative predictors can help out-of-sample performance.