For both data1 and data2 tibbles (a tibble is a data frame with some metadata attached):
pull()The pull() is the dplyr way to return a vector from a data frame. If your data frame has only one row, then pull() returns a scalar. Note the following two commands do the same. IMO I use whatever is more convenient at the moment.
mtcars %>% pull(mpg)
mtcars$mpg
The simplest solution is to run \(k=2\) fold crossvalidation like done in ChalkTalk 1.7:
# Consider a range of degrees of freedom. I chose this through trial and error.
df_vector <- 2:60
# Save results here
results <- data_frame(
df = df_vector,
RMSE = 0
)
for(i in 1:length(df_vector)){
# Note: these two datasets are disjoint
data1_A <- data1 %>%
sample_frac(.5)
data1_B <- data1 %>%
anti_join(data1_A, by="ID")
# Fit model onto A, get predictions for B, compute RMSE for B
model <- smooth.spline(x=data1_A$x, y=data1_A$y, df=df_vector[i])
predictions <- predict(model, data1_B$x) %>%
as_tibble() %>%
pull(y)
RMSE_B <- data1_B %>%
mutate(yhat = predictions) %>%
summarise(RMSE = sqrt(mean((y-yhat)^2))) %>%
pull(RMSE)
# Fit model onto A, get predictions for B, compute RMSE for B
model <- smooth.spline(x=data1_B$x, y=data1_B$y, df=df_vector[i])
predictions <- predict(model, data1_A$x) %>%
as_tibble() %>%
pull(y)
RMSE_A <- data1_A %>%
mutate(yhat = predictions) %>%
summarise(RMSE = sqrt(mean((y-yhat)^2))) %>%
pull(RMSE)
# Take mean
results$RMSE[i] <- (RMSE_A + RMSE_B)/2
}
What is \(df^*\), the optimal degrees of freedom that yields the lowest estimate \(\widehat{RMSE}\) of the true out-of-sample predictive \(RMSE\)?
| df | RMSE |
|---|---|
| 24 | 15.048 |
Let’s now generalize this idea to 5-fold crossvalidation. Carefully note the folding scheme; recall the code exercise at the end of Lecture 2.4.
# Set up folding scheme. Note how we first shuffle the rows and then have
# roughly equal numbers of observations in each fold. This scheme ensures folds
# are disjoint
data1 <- data1 %>%
sample_frac(1) %>%
mutate(fold = rep(1:5, length=n()))
data1 %>%
count(fold)
| fold | n |
|---|---|
| 1 | 600 |
| 2 | 600 |
| 3 | 600 |
| 4 | 600 |
| 5 | 600 |
# Consider a range of degrees of freedom. this was chosen by trial and error
df_vector <- 2:60
# Save results here
results <- data_frame(
df = df_vector,
RMSE = 0
)
for(i in 1:length(df_vector)){
for(j in 1:5){
# Set up training & validation (AKA "pretend" test) set based on folds
train_set <- data1 %>%
filter(fold == j)
validation_set <- data1 %>%
filter(fold !=j)
# Fit model
model <- smooth.spline(x=train_set$x, y=train_set$y, df=df_vector[i])
# Get predictions
predictions <- predict(model, validation_set$x) %>%
as_tibble() %>%
pull(y)
# Compute MSE and save
results$RMSE[i] <- validation_set %>%
mutate(yhat = predictions) %>%
summarise(RMSE = sqrt(mean((y-yhat)^2))) %>%
pull(RMSE)
}
}
What is \(df^*\), the optimal degrees of freedom that yields the lowest estimate \(\widehat{RMSE}\) of the true out-of-sample predictive \(RMSE\)?
| df | RMSE |
|---|---|
| 24 | 15.038 |
Sanity check: Last, let’s fit a model using the optimal \(df^*\) on all given data and plot it. Doesn’t look unreasonable!
fitted_model_1 <- smooth.spline(x=data1$x, y=data1$y, df=df_star) %>%
augment()
Here are the two unknowns revealed!
Let’s now compare the above plot with the \(\widehat{f}(x)\) the fitted curve in blue below.
What is \(df^*\), the optimal degrees of freedom that yields the lowest estimate \(\widehat{RMSE}\) of the true out-of-sample predictive \(RMSE\)?
| df | RMSE |
|---|---|
| 17 | 25.375 |
Sanity check: Last, let’s fit a model using the optimal \(df^*\) on all given data and plot it. Doesn’t look unreasonable!
fitted_model_2 <- smooth.spline(x=data2$x, y=data2$y, df=df_star) %>%
augment()
Here are the two unknowns revealed!
data1.Let’s now compare the above plot with and \(\widehat{f}(x)\) the fitted curve in blue below.
Let’s compare everything at once:
data1 and data2 was the same, but there was more noise for data2: \(\epsilon\) with standard deviation \(\sigma\) 15 vs 25.You may think that the appropriate estimate \(\widehat{\sigma}\) of \(\sigma\) is the RMSE. Very close! However, nottttt quite. We’ll see when we cover the bias-variance tradeoff that the following holds
\[ \mbox{MSE} = \left(\mbox{Bias}\left[\widehat{f}(x)\right]\right)^2 + \mbox{Var}\left[\widehat{f}(x)\right] + \sigma^2 \] where \(\sigma^2\) is the variance of the error term \(\epsilon\). Stay tuned!
The optimal degrees of freedom \(df^*\) here is 19. For a semi-arbitrarily chosen “tolerance” of 0.05 RMSE units, the simplest model that yields similar performance is \(df=11\), since in the case of splines, “simpler” means fewer degrees of freedom.
Not quite equipped to answer this question yet (see ISLR page 183, section 5.1.4).