Question

For both data1 and data2 tibbles (a tibble is a data frame with some metadata attached):

  • Find the splines model with the best out-of-sample predictive ability.
  • Create a visualizaztion arguing why you chose this particular model.
  • Create a visualizaztion of this model plotted over the given \((x_i, y_i)\) points for \(i=1,\ldots,n=3000\).
  • Give your estimate \(\widehat{\sigma}\) of \(\sigma\) where the noise component \(\epsilon_i\) is distributed with mean 0 and standard deviation \(\sigma\).

Note on pull()

The pull() is the dplyr way to return a vector from a data frame. If your data frame has only one row, then pull() returns a scalar. Note the following two commands do the same. IMO I use whatever is more convenient at the moment.

mtcars %>% pull(mpg)
mtcars$mpg

Data 1

Simpler approach

The simplest solution is to run \(k=2\) fold crossvalidation like done in ChalkTalk 1.7:

# Consider a range of degrees of freedom. I chose this through trial and error.
df_vector <- 2:60

# Save results here
results <- data_frame(
  df = df_vector,
  RMSE = 0
)

for(i in 1:length(df_vector)){
  # Note: these two datasets are disjoint
  data1_A <- data1 %>% 
    sample_frac(.5)
  data1_B <- data1 %>%
    anti_join(data1_A, by="ID")
  
  # Fit model onto A, get predictions for B, compute RMSE for B
  model <- smooth.spline(x=data1_A$x, y=data1_A$y, df=df_vector[i])
  predictions <- predict(model, data1_B$x) %>% 
    as_tibble() %>% 
    pull(y)
  RMSE_B <- data1_B %>% 
    mutate(yhat = predictions) %>% 
    summarise(RMSE = sqrt(mean((y-yhat)^2))) %>% 
    pull(RMSE)
  
  # Fit model onto A, get predictions for B, compute RMSE for B
  model <- smooth.spline(x=data1_B$x, y=data1_B$y, df=df_vector[i])
  predictions <- predict(model, data1_A$x) %>% 
    as_tibble() %>% 
    pull(y)
  RMSE_A <- data1_A %>% 
    mutate(yhat = predictions) %>% 
    summarise(RMSE = sqrt(mean((y-yhat)^2))) %>% 
    pull(RMSE)
  
  # Take mean
  results$RMSE[i] <- (RMSE_A + RMSE_B)/2
}

What is \(df^*\), the optimal degrees of freedom that yields the lowest estimate \(\widehat{RMSE}\) of the true out-of-sample predictive \(RMSE\)?

df RMSE
24 15.048

5-fold crossvalidation approach

Let’s now generalize this idea to 5-fold crossvalidation. Carefully note the folding scheme; recall the code exercise at the end of Lecture 2.4.

# Set up folding scheme. Note how we first shuffle the rows and then have
# roughly equal numbers of observations in each fold. This  scheme ensures folds
# are disjoint
data1 <- data1 %>% 
  sample_frac(1) %>% 
  mutate(fold = rep(1:5, length=n()))
data1 %>% 
  count(fold)
fold n
1 600
2 600
3 600
4 600
5 600
# Consider a range of degrees of freedom. this was chosen by trial and error
df_vector <- 2:60

# Save results here
results <- data_frame(
  df = df_vector,
  RMSE = 0
)

for(i in 1:length(df_vector)){
  for(j in 1:5){
    # Set up training & validation (AKA "pretend" test) set based on folds
    train_set <- data1 %>% 
      filter(fold == j)
    validation_set <- data1 %>% 
      filter(fold !=j)
    
    # Fit model
    model <- smooth.spline(x=train_set$x, y=train_set$y, df=df_vector[i])
    
    # Get predictions
    predictions <- predict(model, validation_set$x) %>% 
      as_tibble() %>% 
      pull(y)
    
    # Compute MSE and save
    results$RMSE[i] <- validation_set %>% 
      mutate(yhat = predictions) %>% 
      summarise(RMSE = sqrt(mean((y-yhat)^2))) %>% 
      pull(RMSE)
  }
}

What is \(df^*\), the optimal degrees of freedom that yields the lowest estimate \(\widehat{RMSE}\) of the true out-of-sample predictive \(RMSE\)?

df RMSE
24 15.038

Sanity check: Last, let’s fit a model using the optimal \(df^*\) on all given data and plot it. Doesn’t look unreasonable!

fitted_model_1 <- smooth.spline(x=data1$x, y=data1$y, df=df_star) %>% 
  augment()

Truth vs Fitted

Here are the two unknowns revealed!

  1. \(f(x)\) is the curve in red below
  2. \(\epsilon\), the error, was distributed Normal\((0, \sigma = 15)\). The black dashed lines indicate a 95% “error band” about \(f(x)\) i.e. \(f(x) \pm 1.96 \times \sigma\). Note this error band holds because the error was set to be Normal, which is not necessarily always the case!

Let’s now compare the above plot with the \(\widehat{f}(x)\) the fitted curve in blue below.

Data 2

5-fold crossvalidation approach

What is \(df^*\), the optimal degrees of freedom that yields the lowest estimate \(\widehat{RMSE}\) of the true out-of-sample predictive \(RMSE\)?

df RMSE
17 25.375

Sanity check: Last, let’s fit a model using the optimal \(df^*\) on all given data and plot it. Doesn’t look unreasonable!

fitted_model_2 <- smooth.spline(x=data2$x, y=data2$y, df=df_star) %>% 
  augment()

Truth vs Fitted

Here are the two unknowns revealed!

  1. \(f(x)\) is the curve in red below. The same as for data1.
  2. \(\epsilon\), the error, was distributed Normal\((0, \sigma = 25)\). The black dashed lines indicate a 95% “error band” about \(f(x)\) i.e. \(f(x) \pm 1.96 \times \sigma\). Note this error band holds because the error was set to be Normal, which is not necessarily always the case!

Let’s now compare the above plot with and \(\widehat{f}(x)\) the fitted curve in blue below.

Comparison

Let’s compare everything at once:

  • Recall
    • The true curve for both data1 and data2 was the same, but there was more noise for data2: \(\epsilon\) with standard deviation \(\sigma\) 15 vs 25.
    • This larger \(\sigma\) led to RMSE’s of 15.548 vs 25.375.
  • Moral: When there is more noise (in this case when \(\epsilon\) has a bigger standard deviation) for the same signal (the red curve), models will have a harder time “getting it right.”

Additionally…

Estimate \(\widehat{\sigma}\) of \(\sigma\)

You may think that the appropriate estimate \(\widehat{\sigma}\) of \(\sigma\) is the RMSE. Very close! However, nottttt quite. We’ll see when we cover the bias-variance tradeoff that the following holds

\[ \mbox{MSE} = \left(\mbox{Bias}\left[\widehat{f}(x)\right]\right)^2 + \mbox{Var}\left[\widehat{f}(x)\right] + \sigma^2 \] where \(\sigma^2\) is the variance of the error term \(\epsilon\). Stay tuned!

How to pick optimal degrees of freedom \(df^*\)

The optimal degrees of freedom \(df^*\) here is 19. For a semi-arbitrarily chosen “tolerance” of 0.05 RMSE units, the simplest model that yields similar performance is \(df=11\), since in the case of splines, “simpler” means fewer degrees of freedom.

How to pick number of folds \(k\) for crossvalidation

Not quite equipped to answer this question yet (see ISLR page 183, section 5.1.4).