Assignment 1 - Basics of Hyperparameter Tuning: Finding the Optimal Span & Degree for LOESS

Author

Gavin Shklanka

library(tidyverse)
library(knitr)
library(kableExtra)

knitr::opts_chunk$set(
  fig.width = 8,
  fig.height = 4.8,
  dpi = 150
)

pretty_kable <- function(x, caption = NULL, digits = 3) {
  x %>%
    mutate(across(where(is.numeric), ~ round(.x, digits))) %>%
    kable(caption = caption, format = "html", align = "l") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed", "responsive"),
      full_width = FALSE,
      position = "left"
    )
}

rmspe <- function(y_true, y_pred) {
  ok <- is.finite(y_true) & is.finite(y_pred)
  sqrt(mean((y_true[ok] - y_pred[ok])^2))
}

theme_set(theme_minimal(base_size = 13))

a. Setup & Exploration

set.seed(5560)

n <- 500
x <- sort(runif(n, min = 0, max = 30))
f_true <- 50 + 15*x - 0.3*x^2 + 30*sin(x/3) + 10*cos(x)
y <- f_true + rnorm(n, mean = 0, sd = 15)

dat <- tibble(
  y = y,
  x = x,
  f_true = f_true
)

print("DF CREATED")

[1] "DF CREATED"

print(class(df))

[1] "function"

print(nrow(df))

NULL

Scatter Plot Showcasing True Function

ggplot(dat, aes(x, y)) +
  geom_point(alpha = 0.35) +
  geom_line(aes(y = f_true),
            color = "yellow",
            linetype = "dashed",
            linewidth = 1) +
  labs(title = "Simulated Data with True Function",
       x = "x",
       y = "y")

Question 1: What does the function reveal?

Response: It reveals the structure of the model. Specifically the non-linearity of it. With both the global curvature and the local oscillations. the figure shows that the signal bends and fluctuates, rather than linearly. Infering that a linear model cannot accommodate the complexity of the dimensions, nor the variation.

A linear model cannot accommodate this structure. It would impose a rigid form on a flexible signal.

Flexible models outperform rigid ones when underlying signal is non-linear
Flexibility must be controlled - Accounting for variance VIA tuning.

These principles generelize past LOESS, you’ll find similarities within tree depth in random forests, the learnign rate in neural netwroks, architecture depth & size in deep learning. Flexibility must be controlled, while accounting for variance via tuning.

1.B Effect of Span Parameter

spans_demo <- c(0.2, 0.5, 0.8)

pred_demo <- purrr::map_dfr(spans_demo, function(s) {
  m <- loess(y ~ x, data = dat, span = s, degree = 1)
  tibble(
    x = dat$x,
    y_hat = predict(m, newdata = data.frame(x = dat$x)),
    span = paste0("span=", s)
  )
})

ggplot(dat, aes(x, y)) +
  geom_point(alpha = 0.25) +
  geom_line(aes(y = f_true),
            color = "red",
            linetype = "dashed",
            linewidth = 1) +
  geom_line(data = pred_demo,
            aes(x = x, y = y_hat, color = span),
            linewidth = 1) +
  labs(title = "LOESS Fits for Different Span Values (degree = 1)",
       x = "x",
       y = "y",
       color = "Span")

Question 2 Span’s affect towards LOESS fit

Response: Span controls the neighborhood size used for local regression. From the table the lowest RMSPE occurs at span of 0.15 with RMSPE 13.978. At span =0.90, RMSPE increases dramatically to 24.223.

Small Span: Highly adaptive, captures local oscillations, low bias, higher variance

Large Span: Smooth and Stable, Misses local fluctuations, higher bias, lower variance.

Relationship is clear. RMSPE shows U-Shape. Large Span oversmooths the signal. Smaller spans track the oscillation structure more closely

Showcasing the bias-variance trade off in the span selection.

Question 3 - What is the implication of the Bias-Variance?

Response: Span functions as a guage of complexity. What I mean by that, is that looking at the characteristics, reducing span decreases bias, but raises variances. Vice versa to showcase the later. In reference to the pedagogie of ML as a whole, this tradeoff is universal.

Model Complexity must be tuned to balance bias and variance

Question 4 - But why not choose the smallest span?

Response:

Span functions as a gauge of complexity. From the relationship above, reducing span decreases bias but increases variance. Increasing span increases bias but reduces variance.

The nested CV results reinforce this. Span = 0.15 was selected consistently across all 10 outer iterations. The mean RMSPE was 15.374 with a relatively low SD of 0.604. This indicates that the selected complexity level generalizes consistently across partitions.

Model complexity must be tuned to balance bias and variance. Tuning is not deterministic. Hyperparameter selection depends on data splits rather than simply choosing the smallest span. It is a gauge of complexity. You dial it with the tradeoffs in mind.

Part 2: Simple Train-Test Split

grid_span <- seq(0.1, 0.9, 0.05)

set.seed(100)
idx <- sample(seq_len(nrow(dat)), size = floor(0.8 * nrow(dat)))
train <- dat[idx, ]
test  <- dat[-idx, ]

res_b1 <- purrr::map_dfr(grid_span, function(s) {
  m <- loess(y ~ x, data = train, span = s, degree = 1)
  pred <- predict(m, newdata = data.frame(x = test$x))
  tibble(span = s, rmspe = rmspe(test$y, pred))
})

pretty_kable(res_b1, "Train-Test Split (Seed=100): RMSPE by Span")

Train-Test Split (Seed=100): RMSPE by Span
span	rmspe
0.10	14.179
0.15	13.978
0.20	14.073
0.25	14.423
0.30	14.980
0.35	15.519
0.40	16.161
0.45	17.256
0.50	18.398
0.55	19.792
0.60	20.854
0.65	21.381
0.70	22.452
0.75	23.001
0.80	23.498
0.85	23.897
0.90	24.223

ggplot(res_b1, aes(x = span, y = rmspe)) +
  geom_line() +
  geom_point() +
  geom_vline(xintercept = 0.15, linetype = "dashed") +
  labs(title = "RMSPE vs Span (Single Split)",
       x = "Span",
       y = "Test RMSPE")

best_b1 <- res_b1 %>% slice_min(rmspe, with_ties = FALSE)
best_b1

# A tibble: 1 × 2
   span rmspe
  <dbl> <dbl>
1  0.15  14.0

Question 5 - What span is otimal under one split?

Response:

Under the 80/20 split (seed = 100), the optimal span was approximately 0.15 with the lowest RMSPE. the Test RMSPE = 19.978. To note, this is conditional. Across seeds, RMSPE ranged from 14.307 to 17.929. One split does not define the truth, it reflects on characteristic of the partitiion.

This reinforces that moderate flexibility performs well when the signal is nonlinear. The span was not extremely small, nor overly large. It sat in the range where bias and variance were reasonably balanced.

Part 2b. Instability Across Seeds

seeds <- c(200, 300, 400)

instability <- purrr::map_dfr(seeds, function(sd) {
  set.seed(sd)
  idx <- sample(seq_len(nrow(dat)), size = floor(0.8 * nrow(dat)))
  train <- dat[idx, ]
  test  <- dat[-idx, ]

  res <- purrr::map_dfr(grid_span, function(s) {
    m <- loess(y ~ x, data = train, span = s, degree = 1)
    pred <- predict(m, newdata = data.frame(x = test$x))
    tibble(span = s, rmspe = rmspe(test$y, pred))
  })

  best <- res %>% slice_min(rmspe, with_ties = FALSE)

  tibble(seed = sd,
         best_span = best$span,
         best_rmspe = best$rmspe)
})

pretty_kable(instability, "Instability Across Random Splits")

Instability Across Random Splits
seed	best_span	best_rmspe
200	0.15	17.929
300	0.15	15.980
400	0.10	14.307

Question 6-8 - What does instability do across the seed.

Q6. Response:

Changing the seed changed both the selected span and the resulting performance. The “best” span is not fixed. It varies depending on how the data are partitioned. It varies with sampling. This reinforces a central theme, model tuning is sensitive to partitioning.

Looking at the results:

Seed 200 -> RMSPE = 17.929

Seed 300 -> RMSPE = 15.980

Seed 400 -> RMSPE = 14.307

From that range, the wings show 3.622 units as the difference from the highest and lowest error. Infering that a meaningful swing is caused purely by partition.

For hyperparameter selection, it depends on how the dataset is split. A single split does not provide a stable estimate of the optimal complexity. Which is why we apply cross validation.

Q.7. Response:

If hyperparameters change across splits, then our selection process is unstable. Instability implies that we may be tuning to noise rather than signal.

In practical applications, this is risky. It means that what appears optimal under one split may not generalize under another. The variability observed here demonstrates that relying on one partition can produce misleading conclusions.

Q.8 Response:

This motivates cross-validation. Instead of relying on a single arbitrary split, cross-validation rotates the partitions so each observation contributes to both training and validation. Aaveraging the performance across multiple splits reduces the dependence on one partition, allowing to obtain a more reliable estimate of generalization performance.

Part Three: 10-Fold Nested Cross Validation

set.seed(5560)

K <- 10
outer_iters <- 10
grid_span <- seq(0.1, 0.9, 0.05)

outer_results_c1 <- vector("list", outer_iters)

for (o in 1:outer_iters) {

  idx <- sample(seq_len(nrow(dat)), size = floor(0.8 * nrow(dat)))
  modata <- dat[idx, ]
  test   <- dat[-idx, ]

  fold_id <- sample(rep(1:K, length.out = nrow(modata)))

  cv_grid <- purrr::map_dfr(grid_span, function(s) {

    fold_rmspe <- purrr::map_dbl(1:K, function(k) {
      train_k <- modata[fold_id != k, ]
      valid_k <- modata[fold_id == k, ]

      m <- loess(y ~ x, data = train_k, span = s, degree = 1)
      pred <- predict(m, newdata = data.frame(x = valid_k$x))
      rmspe(valid_k$y, pred)
    })

    tibble(span = s, cv_mean = mean(fold_rmspe))
  })

  best <- cv_grid %>% slice_min(cv_mean, with_ties = FALSE)

  m_best <- loess(y ~ x, data = modata, span = best$span, degree = 1)
  pred_test <- predict(m_best, newdata = data.frame(x = test$x))
  test_r <- rmspe(test$y, pred_test)

  outer_results_c1[[o]] <- tibble(
    iteration = o,
    selected_span = best$span,
    test_rmspe = test_r
  )
}

tab_c1 <- bind_rows(outer_results_c1)
pretty_kable(tab_c1, "Nested CV (Tune Span Only)")

Nested CV (Tune Span Only)
iteration	selected_span	test_rmspe
1	0.15	15.543
2	0.15	14.469
3	0.15	16.223
4	0.15	15.330
5	0.15	16.061
6	0.15	15.532
7	0.15	14.802
8	0.15	14.726
9	0.15	15.071
10	0.15	15.985

ggplot(tab_c1, aes(y = test_rmspe)) +
  geom_boxplot() +
  labs(title = "Distribution of Outer Test RMSPE (Nested CV)",
       y = "Test RMSPE")

summary_c1 <- tab_c1 %>%
  summarise(
    mean_test_rmspe = mean(test_rmspe),
    sd_test_rmspe = sd(test_rmspe)
  )
  

ggplot(tab_c1, aes(y = test_rmspe)) +
  geom_boxplot()

pretty_kable(summary_c1, "Nested CV Summary (Span Only)")

Nested CV Summary (Span Only)
mean_test_rmspe	sd_test_rmspe
15.374	0.604

Question 9

Nested cross-validation separates hyperparameter tuning from performance evaluation. The inner loop selects the best span. The outer loop evaluates that selected span on unseen data.

In this case, span = 0.15 was consistently selected across all 10 outer iterations. The mean outer test RMSPE was 15.374 with a relatively low SD of 0.604. This indicates stable generalization performance.

This separation prevents information leakage. Without it, performance estimates are optimistically biased because tuning and evaluation would occur on overlapping information.

Nested CV enforces procedural fairness in model evaluation. It provides a more defensible and stable estimate of true out-of-sample performance.

Question 10 Mean & SD

The mean represents expected out-of-sample performance under nested cross-validation. The standard deviation represents variability across the 10 outer splits.

Mean test RMSPE = 15.374.
SD = 0.604.

An SD of 0.604 is small compared to the mean, indicating stable performance across partitions. The model’s error did not fluctuate dramatically. Low standard deviation indicates stability. High standard deviation would indicate sensitivity to partitioning.

Both matter. Performance without stability is unreliable. Nested CV achieved both controlled error and controlled variability.

Question 11

Nested CV consistently selected span = 0.15 across all 10 outer iterations. This suggests that moderate flexibility generalizes best for this nonlinear signal. A span of 0.15 allows the model to capture oscillations without overfitting the noise.

The consistency of selection across outer splits strengthens confidence in this hyperparameter choice. It was not a one-split accident. More importantly, nested CV provides a more defensible estimate of generalization performance compared to a single train-test split. It confirms that span = 0.15 is supported by repeated evaluation, not just one partition.

Part 3.B Tune Span + Degree

grid_both <- expand.grid(
  span = seq(0.1, 0.9, 0.1),
  degree = c(1, 2)
)

outer_results_c2 <- vector("list", outer_iters)

for (o in 1:outer_iters) {

  idx <- sample(seq_len(nrow(dat)), size = floor(0.8 * nrow(dat)))
  modata <- dat[idx, ]
  test   <- dat[-idx, ]

  fold_id <- sample(rep(1:K, length.out = nrow(modata)))

  cv_grid <- purrr::map_dfr(seq_len(nrow(grid_both)), function(i) {

    s <- grid_both$span[i]
    d <- grid_both$degree[i]

    fold_rmspe <- purrr::map_dbl(1:K, function(k) {
      train_k <- modata[fold_id != k, ]
      valid_k <- modata[fold_id == k, ]

      m <- loess(y ~ x, data = train_k, span = s, degree = d)
      pred <- predict(m, newdata = data.frame(x = valid_k$x))
      rmspe(valid_k$y, pred)
    })

    tibble(span = s, degree = d, cv_mean = mean(fold_rmspe))
  })

  best <- cv_grid %>% slice_min(cv_mean, with_ties = FALSE)

  m_best <- loess(y ~ x, data = modata,
                  span = best$span,
                  degree = best$degree)

  pred_test <- predict(m_best, newdata = data.frame(x = test$x))
  test_r <- rmspe(test$y, pred_test)

  outer_results_c2[[o]] <- tibble(
    iteration = o,
    selected_span = best$span,
    selected_degree = best$degree,
    test_rmspe = test_r
  )
}

tab_c2 <- bind_rows(outer_results_c2)
pretty_kable(tab_c2, "Nested CV (Span + Degree)")

Nested CV (Span + Degree)
iteration	selected_span	selected_degree	test_rmspe
1	0.2	2	16.258
2	0.2	2	14.824
3	0.2	2	15.914
4	0.2	2	14.715
5	0.2	2	16.303
6	0.2	2	15.711
7	0.2	2	16.021
8	0.2	2	15.063
9	0.2	2	13.664
10	0.2	2	14.986

summary_c2 <- tab_c2 %>%
  summarise(
    mean_test_rmspe = mean(test_rmspe, na.rm = TRUE),
    sd_test_rmspe = sd(test_rmspe, na.rm = TRUE)
  )

pretty_kable(summary_c2, "Nested CV Summary (Span + Degree)")

Nested CV Summary (Span + Degree)
mean_test_rmspe	sd_test_rmspe
15.346	0.841

Questions 12-13 Span & Degree tuning

Extending the hyperparameter space to include polynomial degree resulted in:

Mean RMSPE = 15.346
SD = 0.841

Compared to tuning span only (Mean = 15.374, SD = 0.604), the reduction in mean error was 0.028 units, which is negligible.

However, variability increased from 0.604 to 0.841, indicating greater sensitivity to partitioning.

This suggests that adding degree = 2 increased model flexibility, but the additional flexibility did not sufficiently reduce bias to justify the increase in variance. The generalization error remained essentially unchanged, while stability worsened.

Model complexity must be justified by demonstrable improvements in generalized performance. In this case, the marginal gain in mean error did not outweigh the increase in variability. Objectively looking at the numbers.

Part Four: Bootstrap Cross-Validation

outer_boot <- 8
B <- 80

outer_results_d <- vector("list", outer_boot)

for (o in 1:outer_boot) {

  idx <- sample(seq_len(nrow(dat)), size = floor(0.8 * nrow(dat)))
  modata <- dat[idx, ]
  test   <- dat[-idx, ]

  boot_grid <- purrr::map_dfr(seq_len(nrow(grid_both)), function(i) {

    s <- grid_both$span[i]
    d <- grid_both$degree[i]

    boot_rmspe <- purrr::map_dbl(1:B, function(b) {

      boot_idx <- sample(seq_len(nrow(modata)), replace = TRUE)
      train_b <- modata[boot_idx, ]

      oob_idx <- setdiff(seq_len(nrow(modata)), unique(boot_idx))
      if (length(oob_idx) < 10) return(NA_real_)

      valid_b <- modata[oob_idx, ]

      m <- loess(y ~ x, data = train_b, span = s, degree = d)
      pred <- predict(m, newdata = data.frame(x = valid_b$x))
      rmspe(valid_b$y, pred)
    })

    tibble(span = s, degree = d,
           boot_mean = mean(boot_rmspe, na.rm = TRUE))
  })

  best <- boot_grid %>% slice_min(boot_mean, with_ties = FALSE)

  m_best <- loess(y ~ x, data = modata,
                  span = best$span,
                  degree = best$degree)

  pred_test <- predict(m_best, newdata = data.frame(x = test$x))
  test_r <- rmspe(test$y, pred_test)

  outer_results_d[[o]] <- tibble(
    iteration = o,
    selected_span = best$span,
    selected_degree = best$degree,
    test_rmspe = test_r
  )
}

tab_d <- bind_rows(outer_results_d)
pretty_kable(tab_d, "Bootstrap CV Results")

Bootstrap CV Results
iteration	selected_span	selected_degree	test_rmspe
1	0.3	2	17.193
2	0.2	2	15.867
3	0.3	2	14.309
4	0.3	2	15.883
5	0.2	2	15.802
6	0.1	1	15.093
7	0.3	2	15.164
8	0.2	2	15.391

Question 14 - 18: Bootstrap & Cross-Validation

Response:

Bootstrap results show span values between 0.1 and 0.3 and degrees primarily = 2.

Test RMSPE ranged from 14.309 to 17.193. Bootstrapping estimates sampling variability. Unlike k-fold CV, which evaluates partition stability, bootstrap evaluates sampling variability. It provides insight into hyperparameter sensitivity under resampling.

Part 5: Benchmarch Comparison

set.seed(5560)

eval_models <- purrr::map_dfr(1:10, function(o) {

  idx <- sample(seq_len(nrow(dat)), size = floor(0.8 * nrow(dat)))
  train <- dat[idx, ]
  test  <- dat[-idx, ]

  # Linear
  m_lin <- lm(y ~ x, data = train)
  r_lin <- rmspe(test$y, predict(m_lin, test))

  # Polynomial degree 4
  m_poly <- lm(y ~ poly(x, 4, raw = TRUE), data = train)
  r_poly <- rmspe(test$y, predict(m_poly, test))

  # Default LOESS
  m_def <- loess(y ~ x, data = train, span = 0.75, degree = 1)
  r_def <- rmspe(test$y, predict(m_def, data.frame(x = test$x)))

  tibble(
    linear = r_lin,
    poly4 = r_poly,
    loess_default = r_def
  )
})

bench_summary <- tibble(
  Model = c(
    "Linear",
    "Polynomial (Degree 4)",
    "LOESS Default (0.75, Deg 1)"
  ),
  Mean_RMSPE = c(
    mean(eval_models$linear),
    mean(eval_models$poly4),
    mean(eval_models$loess_default)
  ),
  SD_RMSPE = c(
    sd(eval_models$linear),
    sd(eval_models$poly4),
    sd(eval_models$loess_default)
  )
)

tuned_summary <- tibble(
  Model = "LOESS Tuned (Span + Degree)",
  Mean_RMSPE = mean(tab_c2$test_rmspe),
  SD_RMSPE = sd(tab_c2$test_rmspe)
)

final_table <- bind_rows(bench_summary, tuned_summary)

pretty_kable(final_table, "Benchmark Comparison: All Models")

Benchmark Comparison: All Models
Model	Mean_RMSPE	SD_RMSPE
Linear	28.598	1.040
Polynomial (Degree 4)	17.441	0.944
LOESS Default (0.75, Deg 1)	25.763	1.161
LOESS Tuned (Span + Degree)	15.346	0.841

ggplot(final_table, aes(x = Model, y = Mean_RMSPE)) +
  geom_col() +
  labs(title = "Model Comparison: Mean Test RMSPE",
       x = "Model",
       y = "Mean RMSPE") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Question 19 - 21: Benchmark Comparison & Embedded Conclusion

Response:

Linear regression underperformed due to structural rigidity. It imposed a global linear form on a clearly nonlinear signal, resulting in higher test error relative to the flexible alternatives. Polynomial regression improved performance by capturing global curvature.

However, it remained a global model and could not adapt locally to oscillations in the data. Default LOESS (span = 0.75) oversmoothed the signal. The larger span increased bias and failed to track the oscillatory structure effectively.

Tuned LOESS achieved the lowest and most stable performance. Under nested CV, span = 0.15 produced a mean RMSPE of 15.374 with an SD of 0.604, indicating both competitive accuracy and stability. Extending flexibility further (span + degree tuning) reduced mean RMSPE only marginally to 15.346, while increasing variability (SD = 0.841), suggesting diminishing returns from additional complexity.

Overall, flexible models outperform rigid ones when the signal is nonlinear. But flexibility alone is not sufficient. It must be tuned and evaluated properly. Resampling techniques such as nested cross-validation ensure that performance is not only low, but stable. Reliability requires both accuracy and controlled variability.