The crossv_kfold function from modelr, in combination with broom’s augment, makes it easy to fit a model on 95% of data and predict on the other 5%.

For example, the SMarket dataset shows windows of 6 days in the stock market. One may explore whether one could predict “Today” based on previous days:

library(ISLR)
head(Smarket)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up

We can perform crossfold separation, map a modeling step to the training data, and then fit on the new data with augment:

library(dplyr)
library(purrr)
library(modelr)
library(broom)
library(tidyr)
library(ISLR)

set.seed(1)

models <- Smarket %>%
  select(Today, Lag1:Lag5) %>%
  crossv_kfold(k = 20) %>%
  mutate(model = map(train, ~ lm(Today ~ ., data = .)))

predictions <- models %>%
  unnest(map2(model, test, ~ augment(.x, newdata = .y)))

predictions
## # A tibble: 1,250 × 9
##      .id  Today   Lag1   Lag2   Lag3   Lag4   Lag5      .fitted    .se.fit
##    <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>        <dbl>      <dbl>
## 1     01 -1.334 -0.623 -0.841 -0.151  0.359 -1.747  0.111953989 0.06955107
## 2     01  1.183 -1.334 -0.623 -0.841 -0.151  0.359  0.056723896 0.06116814
## 3     01 -1.792 -2.408  1.763 -1.962  0.587 -2.584  0.161202096 0.13361127
## 4     01 -1.498 -0.854  1.254  3.889  1.028 -0.323  0.030862845 0.13064452
## 5     01  1.501  0.470  1.594 -1.216 -1.498 -0.854  0.015998235 0.08497825
## 6     01 -1.182  0.320 -1.553 -0.263  1.615  0.269 -0.008338452 0.07442202
## 7     01 -0.779 -1.182  0.320 -1.553 -0.263  1.615 -0.014949658 0.08058309
## 8     01 -0.184  1.008 -0.148  1.249 -0.468 -0.151 -0.010059610 0.05908068
## 9     01 -1.733  0.327 -1.142 -0.524  0.396  0.388 -0.004892958 0.05297608
## 10    01 -1.666  0.309 -0.734 -0.383  0.095  0.569 -0.013909327 0.04522048
## # ... with 1,240 more rows

From here we could calculate the cross validated MSE of our model, and compare it to simply predicting the mean:

predictions %>%
  summarize(MSE = mean((Today - .fitted) ^ 2),
            MSEIntercept = mean((Today - mean(Today))^2))
## # A tibble: 1 × 2
##        MSE MSEIntercept
##      <dbl>        <dbl>
## 1 1.312981     1.290222

We can see that it’s not so easy to predict the stock market: we’d do better just predicting the mean change and ignoring the previous days.

We could also check for trends in residuals relative to each variable:

library(ggplot2)

predictions %>%
  mutate(residual = .fitted - Today) %>%
  gather(Days, Lag, Lag1:Lag5) %>%
  ggplot(aes(Lag, residual)) +
  geom_point() +
  facet_wrap(~ Days)

No noticeable trends exist.

Importantly, this can work on models much more complicated than simple linear models (though it’s easier on ones where broom tidiers exist).