Cross validation of linear regression with modelr

The crossv_kfold function from modelr, in combination with broom’s augment, makes it easy to fit a model on 95% of data and predict on the other 5%.

For example, the SMarket dataset shows windows of 6 days in the stock market. One may explore whether one could predict “Today” based on previous days:

library(ISLR)
head(Smarket)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up

We can perform crossfold separation, map a modeling step to the training data, and then fit on the new data with augment:

library(dplyr)
library(purrr)
library(modelr)
library(broom)
library(tidyr)
library(ISLR)

set.seed(1)

models <- Smarket %>%
  select(Today, Lag1:Lag5) %>%
  crossv_kfold(k = 20) %>%
  mutate(model = map(train, ~ lm(Today ~ ., data = .)))

predictions <- models %>%
  unnest(map2(model, test, ~ augment(.x, newdata = .y)))

predictions

## # A tibble: 1,250 × 9
##      .id  Today   Lag1   Lag2   Lag3   Lag4   Lag5      .fitted    .se.fit
##    <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>        <dbl>      <dbl>
## 1     01 -1.334 -0.623 -0.841 -0.151  0.359 -1.747  0.111953989 0.06955107
## 2     01  1.183 -1.334 -0.623 -0.841 -0.151  0.359  0.056723896 0.06116814
## 3     01 -1.792 -2.408  1.763 -1.962  0.587 -2.584  0.161202096 0.13361127
## 4     01 -1.498 -0.854  1.254  3.889  1.028 -0.323  0.030862845 0.13064452
## 5     01  1.501  0.470  1.594 -1.216 -1.498 -0.854  0.015998235 0.08497825
## 6     01 -1.182  0.320 -1.553 -0.263  1.615  0.269 -0.008338452 0.07442202
## 7     01 -0.779 -1.182  0.320 -1.553 -0.263  1.615 -0.014949658 0.08058309
## 8     01 -0.184  1.008 -0.148  1.249 -0.468 -0.151 -0.010059610 0.05908068
## 9     01 -1.733  0.327 -1.142 -0.524  0.396  0.388 -0.004892958 0.05297608
## 10    01 -1.666  0.309 -0.734 -0.383  0.095  0.569 -0.013909327 0.04522048
## # ... with 1,240 more rows

From here we could calculate the cross validated MSE of our model, and compare it to simply predicting the mean:

predictions %>%
  summarize(MSE = mean((Today - .fitted) ^ 2),
            MSEIntercept = mean((Today - mean(Today))^2))

## # A tibble: 1 × 2
##        MSE MSEIntercept
##      <dbl>        <dbl>
## 1 1.312981     1.290222

We can see that it’s not so easy to predict the stock market: we’d do better just predicting the mean change and ignoring the previous days.

We could also check for trends in residuals relative to each variable:

library(ggplot2)

predictions %>%
  mutate(residual = .fitted - Today) %>%
  gather(Days, Lag, Lag1:Lag5) %>%
  ggplot(aes(Lag, residual)) +
  geom_point() +
  facet_wrap(~ Days)

No noticeable trends exist.

Importantly, this can work on models much more complicated than simple linear models (though it’s easier on ones where broom tidiers exist).

Cross validation of linear regression with modelr

David Robinson

September 16, 2016