Question 1: Single-Factor (Market) Model [25 pts]

Model: \(R_i - R_f = \alpha + \beta(R_m - R_f) + \varepsilon\)

Term Estimate Std. Error
Intercept (\(\alpha\)) 0.0017 0.0020
Market premium (\(\beta\)) 0.98 0.17

\(R^2 = 0.50\), \(n = 96\) months, \(E[R_m - R_f] = 0.70\%\), critical \(|t| \approx 1.98\)


(a) t-statistic for \(\beta\) — Test \(H_0: \beta = 0\)

beta_hat <- 0.98;  se_beta <- 0.17;  t_crit <- 1.98
t_beta <- beta_hat / se_beta
cat("t-statistic for β:", round(t_beta, 4), "\n")
## t-statistic for β: 5.7647
cat("Reject H₀:", abs(t_beta) > t_crit, "\n")
## Reject H₀: TRUE

\[t_{\beta} = \frac{0.98}{0.17} = 5.7647\]

x <- seq(-7, 7, length.out = 500)
df_t <- data.frame(x = x, y = dt(x, df = 94))
ggplot(df_t, aes(x, y)) +
  geom_line(color = "#a78bfa", linewidth = 1) +
  geom_area(data = filter(df_t, x < -t_crit), fill = "#7c3aed", alpha = 0.5) +
  geom_area(data = filter(df_t, x >  t_crit), fill = "#7c3aed", alpha = 0.5) +
  geom_vline(xintercept = t_beta,  color = "#fbbf24", linewidth = 1.2, linetype = "dashed") +
  geom_vline(xintercept = -t_crit, color = "#f472b6", linewidth = 0.8) +
  geom_vline(xintercept =  t_crit, color = "#f472b6", linewidth = 0.8) +
  annotate("label", x = t_beta,  y = 0.25, label = paste0("t = ", round(t_beta,2)),
           fill = "#fbbf24", color = "#000", fontface = "bold") +
  annotate("label", x = -5, y = 0.15, label = "Reject H₀", fill = "#7c3aed", color = "#fff") +
  annotate("label", x =  5, y = 0.15, label = "Reject H₀", fill = "#7c3aed", color = "#fff") +
  labs(title = "t-Distribution (df=94) — Test H₀: β = 0",
       subtitle = "Purple: rejection region  |  Yellow dashed: observed t = 5.76",
       x = "t value", y = "Density") +
  dark_theme

REJECT H₀: β = 0
|t| = 5.7647 > 1.98. β = 0.98 is highly significant — the fund moves almost 1-for-1 with the market.


(b) Test \(H_0: \beta = 1\)

t_beta1 <- (beta_hat - 1) / se_beta
cat("t-statistic for H₀: β = 1:", round(t_beta1, 4))
## t-statistic for H₀: β = 1: -0.1176

\[t = \frac{0.98 - 1}{0.17} = -0.1176\]

FAIL TO REJECT H₀: β = 1
|t| = 0.1176 < 1.98. The fund’s systematic risk is statistically indistinguishable from the market — no evidence of over- or under-exposure relative to a passive index.


(c) Jensen’s Alpha — Test \(H_0: \alpha = 0\)

alpha_hat <- 0.0017; se_alpha <- 0.0020
t_alpha <- alpha_hat / se_alpha
cat("t-statistic for α:", round(t_alpha, 4))
## t-statistic for α: 0.85

\[t_{\alpha} = \frac{0.0017}{0.0020} = 0.85\]

FAIL TO REJECT H₀: α = 0
|t| = 0.85 < 1.98. The marketing claim of “positive risk-adjusted performance” is NOT statistically justified. The positive α could easily be sampling noise.


(d) \(R^2\) Decomposition

r2 <- 0.50
df_pie <- data.frame(
  type  = c("Systematic (Market)\nR² = 50%", "Idiosyncratic\n(Diversifiable) 50%"),
  value = c(r2, 1 - r2),
  fill  = c("#7c3aed", "#f472b6")
)
ggplot(df_pie, aes(x = "", y = value, fill = type)) +
  geom_col(width = 1, color = "#0f0f1a", linewidth = 1.5) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("#7c3aed","#f472b6")) +
  geom_text(aes(label = paste0(value*100, "%")),
            position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold", size = 6) +
  labs(title = "R² = 0.50 — Variance Decomposition",
       fill = NULL) +
  dark_theme +
  theme(axis.text = element_blank(), axis.title = element_blank(),
        panel.grid = element_blank())

Interpretation: 50% of the fund’s return variance is driven by market movements (systematic); the remaining 50% is idiosyncratic and diversifiable.


(e) CAPM-Implied Expected Return

E_mkt <- 0.0070
CAPM_ret <- beta_hat * E_mkt
cat("CAPM expected monthly excess return:", round(CAPM_ret*100, 4), "%")
## CAPM expected monthly excess return: 0.686 %

\[E[R_i - R_f] = 0.98 \times 0.70\% = 0.686\%\text{ per month}\]


Question 2: Fama–French Three-Factor Model [25 pts]

Model: \(R_i - R_f = \alpha + b \cdot MKT + s \cdot SMB + h \cdot HML + \varepsilon\)

Term Estimate Std. Error
\(\alpha\) 0.0029 0.0018
MKT (\(b\)) 0.97 0.08
SMB (\(s\)) 0.75 0.11
HML (\(h\)) −0.13 0.13

\(R^2 = 0.92\), Adj. \(R^2 = 0.918\), \(n = 144\)


(f) t-statistics for All Coefficients

coefs <- c(alpha=0.0029, MKT=0.97, SMB=0.75, HML=-0.13)
ses   <- c(alpha=0.0018, MKT=0.08, SMB=0.11, HML=0.13)
t_vals <- coefs / ses
sig    <- abs(t_vals) > 1.98
data.frame(Coefficient=names(coefs), Estimate=coefs, SE=ses,
           t_stat=round(t_vals,4), Significant=ifelse(sig,"Yes ","No "),
           row.names=NULL)
df_t2 <- data.frame(
  coef  = factor(names(t_vals), levels=names(t_vals)),
  t_abs = abs(t_vals),
  sig   = sig,
  color = c("#f472b6","#60a5fa","#34d399","#fb923c")
)
ggplot(df_t2, aes(x=coef, y=t_abs, fill=color)) +
  geom_col(width=0.55, color="#0f0f1a", linewidth=0.8) +
  geom_hline(yintercept=1.98, color="#fbbf24", linewidth=1.2, linetype="dashed") +
  geom_text(aes(label=round(t_abs,2)), vjust=-0.5, color="white", fontface="bold", size=5) +
  annotate("label", x=0.6, y=2.2, label="Critical |t| = 1.98",
           fill="#fbbf24", color="#000", size=3.5) +
  scale_fill_identity() +
  scale_x_discrete(labels=c("α","MKT (b)","SMB (s)","HML (h)")) +
  labs(title="FF3 Factor t-statistics vs. Critical Value",
       subtitle="Bars above yellow line are significant at 5%",
       x="Coefficient", y="|t-statistic|") +
  dark_theme

Significant factors: MKT (t=12.13) and SMB (t=6.82)
Not significant: α (t=1.61) and HML (t=−1.00)


(g) Investment Style Classification

df_style <- data.frame(
  x = c(0, -0.13),   # HML axis
  y = c(0,  0.75),   # SMB axis
  label = c("Market", "This Fund")
)
ggplot() +
  annotate("rect", xmin=-2, xmax=0, ymin=0,  ymax=2,  fill="#1e3a5f", alpha=0.4) +
  annotate("rect", xmin=0,  xmax=2, ymin=0,  ymax=2,  fill="#1e1040", alpha=0.4) +
  annotate("rect", xmin=-2, xmax=0, ymin=-2, ymax=0,  fill="#1c0f00", alpha=0.4) +
  annotate("rect", xmin=0,  xmax=2, ymin=-2, ymax=0,  fill="#052e16", alpha=0.4) +
  annotate("text", x=-1, y=1,  label="Small-Cap\nGrowth",  color="#60a5fa",  size=4.5, fontface="bold") +
  annotate("text", x= 1, y=1,  label="Small-Cap\nValue",   color="#fbbf24",  size=4.5, fontface="bold") +
  annotate("text", x=-1, y=-1, label="Large-Cap\nGrowth",  color="#f472b6",  size=4.5, fontface="bold") +
  annotate("text", x= 1, y=-1, label="Large-Cap\nValue",   color="#34d399",  size=4.5, fontface="bold") +
  geom_hline(yintercept=0, color="#475569", linewidth=0.8) +
  geom_vline(xintercept=0, color="#475569", linewidth=0.8) +
  geom_point(aes(x=0, y=0), color="#94a3b8", size=5) +
  geom_point(aes(x=-0.13, y=0.75), color="#a78bfa", size=8) +
  annotate("label", x=-0.13, y=0.75, label="This Fund\nh=−0.13, s=0.75",
           fill="#7c3aed", color="white", hjust=-0.1, size=3.5) +
  annotate("label", x=0, y=0, label="Market", fill="#334155", color="white", hjust=-0.2, size=3.5) +
  scale_x_continuous(limits=c(-2,2), name="HML Loading (h)  →  Value") +
  scale_y_continuous(limits=c(-2,2), name="SMB Loading (s)  →  Small-Cap") +
  labs(title="Fama-French Style Box — Fund Classification") +
  dark_theme

Size tilt: s = 0.75 (large, significant) → Small-cap bias
Value/Growth tilt: h = −0.13 (negative) → Growth bias
Classification: Small-Cap Growth Fund


(h) Alpha Interpretation

cat("Monthly alpha:", 0.0029, "| t =", round(t_vals["alpha"],4))
## Monthly alpha: 0.0029 | t = 1.6111
cat("\nAnnualized alpha ≈", round(0.0029*12*100,2), "%")
## 
## Annualized alpha ≈ 3.48 %

α = 0.0029/month ≈ 3.48%/year. However, |t| = 1.61 < 1.98 → Not statistically significant. The manager does not demonstrably add value beyond the three factor exposures.


(i) \(R^2\) Comparison: CAPM vs. FF3

df_r2 <- data.frame(
  model = factor(c("CAPM\n(1 Factor)","FF3\n(3 Factors)"), levels=c("CAPM\n(1 Factor)","FF3\n(3 Factors)")),
  r2    = c(0.75, 0.92),
  adj   = c(NA,   0.918),
  fill  = c("#3b82f6","#7c3aed")
)
ggplot(df_r2, aes(x=model, y=r2, fill=fill)) +
  geom_col(width=0.4, color="#0f0f1a", linewidth=1) +
  geom_text(aes(label=paste0("R²=",r2)), vjust=-0.5, color="white", fontface="bold", size=5.5) +
  geom_point(data=filter(df_r2,!is.na(adj)), aes(y=adj), color="#34d399", size=5, shape=18) +
  annotate("label", x=2, y=0.895, label="Adj.R²=0.918", fill="#064e3b", color="#34d399", size=3.5) +
  scale_fill_identity() +
  scale_y_continuous(limits=c(0,1), labels=scales::percent) +
  labs(title="R² Comparison: CAPM vs. Fama-French Three-Factor",
       subtitle="Green diamond = Adjusted R²",
       x="Model", y="R²") +
  dark_theme

The 17pp gain (0.75→0.92) is driven primarily by the SMB factor capturing the fund’s small-cap tilt. Adj. R² is used for model comparison because it penalises for additional predictors — here Adj. R² = 0.918 ≈ R² confirms both extra factors genuinely contribute.


Question 3: Logistic Regression for Market Direction [25 pts]

Model: \(\text{logit}\,P(\text{Up}) = \beta_0 + \beta_1 r_{t-1} + \beta_2 \Delta VIX_{t-1}\)

\(\beta_0 = -0.02,\; \beta_1 = 5.4,\; \beta_2 = -0.38\). Inputs: \(r_{t-1}=0.010,\;\Delta VIX=1.5\)


(j) Predicted Probability

beta0<- -0.02; beta1<- 5.4; beta2<- -0.38
logit_p <- beta0 + beta1*0.010 + beta2*1.5
prob_up  <- 1/(1+exp(-logit_p))
cat("Logit:", round(logit_p,4), "| P(Up):", round(prob_up,4),
    "| Class:", ifelse(prob_up>=0.5,"Up","Down"))
## Logit: -0.536 | P(Up): 0.3691 | Class: Down
r_seq <- seq(-0.05, 0.05, length.out=200)
df_logit <- data.frame(
  r   = r_seq,
  p   = 1/(1+exp(-(beta0 + beta1*r_seq + beta2*1.5)))
)
ggplot(df_logit, aes(r, p)) +
  geom_line(color="#a78bfa", linewidth=1.5) +
  geom_hline(yintercept=0.5, color="#fbbf24", linetype="dashed", linewidth=0.9) +
  geom_vline(xintercept=0.010, color="#f472b6", linetype="dotted", linewidth=1) +
  geom_point(aes(x=0.010, y=prob_up), color="#f472b6", size=5) +
  annotate("label", x=0.010, y=prob_up, label=paste0("P(Up)=",round(prob_up,4),"\n→ Predict DOWN"),
           fill="#4c0519", color="#f9a8d4", hjust=-0.1, size=3.5) +
  annotate("label", x=-0.045, y=0.53, label="Decision\nboundary 0.5",
           fill="#1c1c00", color="#fbbf24", size=3.2) +
  scale_x_continuous(labels=scales::percent) +
  scale_y_continuous(labels=scales::percent, limits=c(0,1)) +
  labs(title="Logistic Regression — P(Up) vs. Lagged Return",
       subtitle="Holding ΔVIX = 1.5 fixed",
       x="Lagged Return (rₜ₋₁)", y="P(Up)") +
  dark_theme

\[\text{logit} = -0.02 + 5.4(0.01) - 0.38(1.5) = -0.536 \implies P(\text{Up}) = 0.3691\]

Predicted class: DOWN (P = 0.3691 < 0.5)

(k) Economic Interpretation of Coefficients

df_coef <- data.frame(
  var   = c("β₁ (Lagged Return)", "β₂ (ΔVIX)"),
  val   = c(5.4, -0.38),
  color = c("#34d399","#f87171"),
  econ  = c("Momentum: yesterday's\ngain → bullish today", "Fear: rising VIX\n→ bearish today")
)
ggplot(df_coef, aes(x=var, y=val, fill=color)) +
  geom_col(width=0.4, color="#0f0f1a") +
  geom_text(aes(label=paste0(ifelse(val>0,"+",""),val,"\n\n",econ)),
            vjust=ifelse(df_coef$val>0,-0.1,1.2), color="white", size=3.8, fontface="bold") +
  geom_hline(yintercept=0, color="#475569") +
  scale_fill_identity() +
  labs(title="Logistic Regression Coefficients — Economic Meaning",
       x=NULL, y="Log-Odds Coefficient") +
  dark_theme +
  theme(axis.text.x=element_text(size=11, color="#e2e8f0"))

  • β₁ = 5.4 > 0: Positive lagged return → higher P(Up). Captures short-term momentum — market continuation over daily horizons.
  • β₂ = −0.38 < 0: Rising VIX → lower P(Up). Captures the fear premium — volatility spikes coincide with negative returns.

(l) Confusion Matrix Metrics

TP<-67; FP<-44; FN<-33; TN<-56; N<-200
acc  <- (TP+TN)/N
sens <- TP/(TP+FN)
spec <- TN/(TN+FP)
prec <- TP/(TP+FP)
cat(sprintf("Accuracy:    %.4f (%.1f%%)\n", acc,  acc*100))
## Accuracy:    0.6150 (61.5%)
cat(sprintf("Sensitivity: %.4f (%.1f%%)\n", sens, sens*100))
## Sensitivity: 0.6700 (67.0%)
cat(sprintf("Specificity: %.4f (%.1f%%)\n", spec, spec*100))
## Specificity: 0.5600 (56.0%)
cat(sprintf("Precision:   %.4f (%.1f%%)\n", prec, prec*100))
## Precision:   0.6036 (60.4%)
cm_df <- data.frame(
  Predicted = factor(c("Up","Up","Down","Down"),  levels=c("Down","Up")),
  Actual    = factor(c("Up","Down","Up","Down"),   levels=c("Down","Up")),
  Count     = c(67,44,33,56),
  Type      = c("TP","FP","FN","TN")
)
ggplot(cm_df, aes(Actual, Predicted, fill=Count)) +
  geom_tile(color="#0f0f1a", linewidth=2) +
  geom_text(aes(label=paste0(Type,"\nn=",Count)), color="white", fontface="bold", size=6) +
  scale_fill_gradient(low="#1e3a5f", high="#7c3aed") +
  labs(title="Confusion Matrix — 200-Day Hold-Out Set",
       x="Actual Class", y="Predicted Class") +
  dark_theme +
  theme(legend.position="none",
        axis.text=element_text(size=13, color="#e2e8f0"))

df_metrics <- data.frame(
  metric = c("Accuracy","Sensitivity\n(Recall)","Specificity","Precision"),
  value  = c(acc, sens, spec, prec),
  fill   = c("#7c3aed","#34d399","#60a5fa","#fb923c")
)
ggplot(df_metrics, aes(metric, value, fill=fill)) +
  geom_col(width=0.5, color="#0f0f1a") +
  geom_text(aes(label=paste0(round(value*100,1),"%")), vjust=-0.5, color="white", fontface="bold", size=5) +
  geom_hline(yintercept=0.5, color="#fbbf24", linetype="dashed") +
  scale_fill_identity() +
  scale_y_continuous(limits=c(0,1), labels=scales::percent) +
  labs(title="Model Performance Metrics", x=NULL, y="Rate") +
  dark_theme


(m) Naive Benchmark & Accuracy Limitations

naive_acc <- 100/200
cat("Naive accuracy:", naive_acc, "| Model accuracy:", acc)
## Naive accuracy: 0.5 | Model accuracy: 0.615
cat("\nModel beats naive:", acc > naive_acc)
## 
## Model beats naive: TRUE

\[\text{Naive accuracy} = \frac{100}{200} = 50\% \quad \text{Model: } 61.5\% \implies \text{Model wins}\]

Why accuracy alone is inadequate for trading:
1. Asymmetric costs: A false positive (wrong UP call) → you enter a losing trade; a false negative (missed UP) → opportunity cost. These are not symmetric.
2. Imbalanced real markets distort accuracy interpretation.
3. No accounting for transaction costs or position sizing.
Better criterion: Strategy Sharpe Ratio (risk-adjusted P&L)


Question 4: Resampling & Regularization [25 pts]

\(\bar{r} = 0.70\%/\text{month}\), \(\hat{\sigma} = 5.50\%/\text{month}\), \(n = 48\) months


(n) Sharpe Ratio — Monthly and Annualized

mu<-0.007; sigma<-0.055; n<-48
SR_m <- mu/sigma
SR_a <- SR_m * sqrt(12)
cat(sprintf("Monthly SR: %.4f | Annualized SR: %.4f (×√12=%.4f)", SR_m, SR_a, sqrt(12)))
## Monthly SR: 0.1273 | Annualized SR: 0.4409 (×√12=3.4641)

\[SR_{\text{monthly}} = \frac{0.007}{0.055} = 0.1273 \qquad SR_{\text{annual}} = 0.1273 \times \sqrt{12} = 0.4409\]

freqs  <- c("Daily\n(×√252)","Weekly\n(×√52)","Monthly\n(×√12)","Quarterly\n(×√4)")
scales <- sqrt(c(252,52,12,4))
SR_ann <- SR_m * scales
df_sr  <- data.frame(freq=factor(freqs,levels=freqs), sr=SR_ann,
                     hi=ifelse(freqs=="Monthly\n(×√12)", TRUE, FALSE))
ggplot(df_sr, aes(freq, sr, fill=hi)) +
  geom_col(width=0.5, color="#0f0f1a") +
  geom_text(aes(label=round(sr,3)), vjust=-0.5, color="white", fontface="bold", size=5) +
  scale_fill_manual(values=c("#1e3a5f","#7c3aed")) +
  geom_hline(yintercept=1, color="#fbbf24", linetype="dashed") +
  annotate("label", x=0.6, y=1.05, label="SR=1 benchmark", fill="#1c1c00", color="#fbbf24", size=3) +
  labs(title="Annualized Sharpe Ratio by Return Frequency",
       subtitle="Purple bar = our monthly data",
       x="Observation Frequency", y="Annualized SR") +
  dark_theme + theme(legend.position="none")

Scaling factor: \(\sqrt{12}\) — mean scales by 12, std dev by \(\sqrt{12}\), so SR scales by \(\sqrt{12}\).


(o) Bootstrap Standard Error for Sharpe Ratio

set.seed(42)
r_sim <- rnorm(48, mean=0.007, sd=0.055)
B <- 10000
SR_boot <- replicate(B, {
  rb <- sample(r_sim, 48, replace=TRUE)
  mean(rb)/sd(rb)
})
SE_boot <- sd(SR_boot)
ci <- quantile(SR_boot, c(0.025,0.975))
cat(sprintf("Bootstrap SE: %.4f\n95%% CI: [%.4f, %.4f]", SE_boot, ci[1], ci[2]))
## Bootstrap SE: 0.1490
## 95% CI: [-0.2097, 0.3765]
df_boot <- data.frame(sr=SR_boot)
ggplot(df_boot, aes(sr)) +
  geom_histogram(bins=60, fill="#7c3aed", color="#0f0f1a", alpha=0.85) +
  geom_vline(xintercept=SR_m,  color="#fbbf24", linewidth=1.2, linetype="dashed") +
  geom_vline(xintercept=ci[1], color="#f472b6", linewidth=1, linetype="dotted") +
  geom_vline(xintercept=ci[2], color="#f472b6", linewidth=1, linetype="dotted") +
  annotate("label", x=SR_m,   y=400, label=paste0("SR=",round(SR_m,4)), fill="#fbbf24", color="#000") +
  annotate("label", x=ci[1],  y=300, label=paste0("2.5%: ",round(ci[1],3)), fill="#4c0519",color="#f9a8d4",size=3) +
  annotate("label", x=ci[2],  y=300, label=paste0("97.5%: ",round(ci[2],3)),fill="#4c0519",color="#f9a8d4",size=3) +
  labs(title=paste0("Bootstrap Distribution of Monthly SR  (B=",B,")"),
       subtitle=paste0("SE = ",round(SE_boot,4),"  |  95% CI: [",round(ci[1],3),", ",round(ci[2],3),"]"),
       x="Bootstrap SR", y="Count") +
  dark_theme

Why i.i.d. bootstrap is inappropriate: Monthly returns exhibit autocorrelation and volatility clustering (GARCH). Random reshuffling destroys temporal dependence → underestimates variance.

Fix: Stationary Block Bootstrap — resample contiguous blocks of length l, preserving serial correlation within each block.


(p) LASSO \(\lambda\) Selection

set.seed(1)
lambda_grid <- exp(seq(log(0.005), log(0.2), length.out=60))
# Simulated CV error curve (U-shaped around 0.030)
cv_err <- 0.85 + 3*(log(lambda_grid)-log(0.030))^2 + rnorm(60,0,0.02)
cv_se  <- 0.06 + 0.01*abs(log(lambda_grid)-log(0.030))
n_factors <- pmax(0, round(20 - 15*(log(lambda_grid)-log(0.005))/diff(range(log(lambda_grid)))))

df_lasso <- data.frame(lambda=lambda_grid, cv=cv_err, se=cv_se, n=n_factors)
lambda_min <- 0.030; lambda_1se <- 0.065
cv_at_min  <- approx(lambda_grid, cv_err, lambda_min)$y
se_at_min  <- approx(lambda_grid, cv_se,  lambda_min)$y

ggplot(df_lasso, aes(log(lambda), cv)) +
  geom_ribbon(aes(ymin=cv-se, ymax=cv+se), fill="#7c3aed", alpha=0.2) +
  geom_line(color="#a78bfa", linewidth=1.2) +
  geom_hline(yintercept=cv_at_min+se_at_min, color="#fbbf24", linetype="dashed") +
  geom_vline(xintercept=log(lambda_min), color="#34d399", linewidth=1.1) +
  geom_vline(xintercept=log(lambda_1se), color="#fb923c", linewidth=1.1) +
  annotate("label", x=log(lambda_min), y=0.95,
           label="λ_min=0.030\n14 factors", fill="#052e16", color="#34d399", size=3.5) +
  annotate("label", x=log(lambda_1se), y=0.95,
           label="λ_1SE=0.065\n7 factors ", fill="#3d2000", color="#fb923c", size=3.5) +
  labs(title="LASSO Cross-Validation Error vs. log(λ)",
       subtitle="Shaded band = ±1 SE  |  Orange = recommended deployment λ",
       x="log(λ)", y="CV Error") +
  dark_theme

Deploy λ = 0.065 (1-SE rule, 7 factors)
With 60 candidate factors, using λ_min risks overfitting to noise. The 1-SE rule selects the simplest model statistically equivalent to the minimum — more parsimonious, lower transaction costs, and more robust out-of-sample. This guards against data-snooping bias inherent in large factor searches.


(q) Walk-Forward Cross-Validation

T <- 60
folds <- data.frame(
  fold     = 1:5,
  tr_start = 1,
  tr_end   = c(24,30,36,42,48),
  te_start = c(25,31,37,43,49),
  te_end   = c(30,36,42,48,54)
)
holdout <- data.frame(fold=6, tr_start=1, tr_end=48, te_start=49, te_end=60)

ggplot() +
  # Training bars
  geom_rect(data=folds, aes(xmin=tr_start, xmax=tr_end, ymin=fold-0.35, ymax=fold+0.35),
            fill="#7c3aed", alpha=0.85, color="#0f0f1a") +
  # Validation bars
  geom_rect(data=folds, aes(xmin=te_start, xmax=te_end, ymin=fold-0.35, ymax=fold+0.35),
            fill="#34d399", alpha=0.85, color="#0f0f1a") +
  # Final test
  geom_rect(data=holdout, aes(xmin=tr_start, xmax=tr_end, ymin=fold-0.35, ymax=fold+0.35),
            fill="#7c3aed", alpha=0.5, color="#0f0f1a") +
  geom_rect(data=holdout, aes(xmin=te_start, xmax=te_end, ymin=fold-0.35, ymax=fold+0.35),
            fill="#f472b6", alpha=0.85, color="#0f0f1a") +
  # Labels
  geom_text(data=folds, aes(x=(tr_start+tr_end)/2, y=fold, label="TRAIN"), color="white", fontface="bold", size=3.5) +
  geom_text(data=folds, aes(x=(te_start+te_end)/2, y=fold, label="VAL"),   color="#0f0f1a", fontface="bold", size=3.5) +
  geom_text(data=holdout, aes(x=(tr_start+tr_end)/2, y=fold, label="FULL TRAIN"), color="white", fontface="bold", size=3) +
  geom_text(data=holdout, aes(x=(te_start+te_end)/2, y=fold, label="FINAL TEST"), color="white", fontface="bold", size=3) +
  geom_text(data=rbind(folds,holdout), aes(x=-1, y=fold, label=ifelse(fold<6,paste("Fold",fold),"Holdout")),
            hjust=1, color="#cbd5e1", size=3.5) +
  scale_x_continuous(limits=c(-4,62), name="Month", breaks=seq(0,60,12)) +
  scale_y_continuous(name=NULL, breaks=NULL) +
  labs(title="Walk-Forward (Time-Respecting) Cross-Validation",
       subtitle="Purple=Train | Green=Validation | Pink=Final Hold-Out Test") +
  dark_theme +
  theme(axis.text.y=element_blank(), panel.grid=element_blank())

Why random k-fold is unsafe for time series: Random shuffling allows training folds to contain future data relative to the validation fold — look-ahead bias. This produces artificially optimistic Sharpe ratios that vanish in live trading.

Walk-forward respects the arrow of time: training always uses only past data, validation always uses future data — mirroring real deployment.


End of Examination