The crossv_kfold function from modelr, in combination with broom’s augment, makes it easy to fit a model on 95% of data and predict on the other 5%.
For example, the SMarket dataset shows windows of 6 days in the stock market. One may explore whether one could predict “Today” based on previous days:
library(ISLR)
head(Smarket)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
## 2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
## 3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
## 4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
## 5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
## 6 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
We can perform crossfold separation, map a modeling step to the training data, and then fit on the new data with augment:
library(dplyr)
library(purrr)
library(modelr)
library(broom)
library(tidyr)
library(ISLR)
set.seed(1)
models <- Smarket %>%
select(Today, Lag1:Lag5) %>%
crossv_kfold(k = 20) %>%
mutate(model = map(train, ~ lm(Today ~ ., data = .)))
predictions <- models %>%
unnest(map2(model, test, ~ augment(.x, newdata = .y)))
predictions
## # A tibble: 1,250 × 9
## .id Today Lag1 Lag2 Lag3 Lag4 Lag5 .fitted .se.fit
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01 -1.334 -0.623 -0.841 -0.151 0.359 -1.747 0.111953989 0.06955107
## 2 01 1.183 -1.334 -0.623 -0.841 -0.151 0.359 0.056723896 0.06116814
## 3 01 -1.792 -2.408 1.763 -1.962 0.587 -2.584 0.161202096 0.13361127
## 4 01 -1.498 -0.854 1.254 3.889 1.028 -0.323 0.030862845 0.13064452
## 5 01 1.501 0.470 1.594 -1.216 -1.498 -0.854 0.015998235 0.08497825
## 6 01 -1.182 0.320 -1.553 -0.263 1.615 0.269 -0.008338452 0.07442202
## 7 01 -0.779 -1.182 0.320 -1.553 -0.263 1.615 -0.014949658 0.08058309
## 8 01 -0.184 1.008 -0.148 1.249 -0.468 -0.151 -0.010059610 0.05908068
## 9 01 -1.733 0.327 -1.142 -0.524 0.396 0.388 -0.004892958 0.05297608
## 10 01 -1.666 0.309 -0.734 -0.383 0.095 0.569 -0.013909327 0.04522048
## # ... with 1,240 more rows
From here we could calculate the cross validated MSE of our model, and compare it to simply predicting the mean:
predictions %>%
summarize(MSE = mean((Today - .fitted) ^ 2),
MSEIntercept = mean((Today - mean(Today))^2))
## # A tibble: 1 × 2
## MSE MSEIntercept
## <dbl> <dbl>
## 1 1.312981 1.290222
We can see that it’s not so easy to predict the stock market: we’d do better just predicting the mean change and ignoring the previous days.
We could also check for trends in residuals relative to each variable:
library(ggplot2)
predictions %>%
mutate(residual = .fitted - Today) %>%
gather(Days, Lag, Lag1:Lag5) %>%
ggplot(aes(Lag, residual)) +
geom_point() +
facet_wrap(~ Days)
No noticeable trends exist.
Importantly, this can work on models much more complicated than simple linear models (though it’s easier on ones where broom tidiers exist).