1
- According to the author, smoothing splines are very commonly used.
Based on what the paper says:
- For what orders are they defined? Smoothing splines are only defined
for odd polynomial orders k. The paper states “Unlike trend filtering,
smoothing splines are only defined for an odd polynomial order k.”
- Which is the most common value for k? The most common value for k
used with smoothing splines is k=3, corresponding to cubic smoothing
splines. The paper states “In practice, it seems that the case k = 3
(i.e., cubic smoothing splines) is by far the most common case
considered.”
- What type of penalty use smoothing splines? Have you seen this in
another method in this class? Smoothing splines use a squared L2 penalty
on the (k+1)/2 derivative of the fitted function f. This is similar to
the ridge regression penalty, which penalizes the squared L2 norm of the
coefficients.
- In the empirical comparisons, which model performs best? What
happens as we increase the degrees of freedom (df)? In the empirical
comparisons on simulated data in Section 2.2, trend filtering generally
performs best, especially for functions with spatially inhomogeneous
smoothness. As the degrees of freedom increase, trend filtering is
better able to adapt to the local level of smoothness compared to
smoothing splines. Even when allowing smoothing splines to have more
degrees of freedom, trend filtering still fits the data better in
regions of high curvature while smoothing splines overfit in smooth
regions. The author attributes trend filtering’s superior performance to
its use of an L1 penalty, which allows it to adapt to local smoothness,
compared to the L2 penalty of smoothing splines.
2
- The author shows several methods in lasso form.
- Which methods are these? The author shows that both trend filtering
and locally adaptive regression splines can be represented in lasso
form. Specifically, trend filtering is expressed as: \[\hat{\alpha} = \arg\min_{\alpha \in \mathbb{R}^n}
\frac{1}{2} |y - H\alpha|2^2 + \lambda \sum{j=k+2}^n |\alpha_j|\]
and locally adaptive regression splines as:
\[\hat{\theta} = \arg\min_{\theta \in
\mathbb{R}^n} \frac{1}{2} |y - G\theta|2^2 + \lambda \sum{j=k+2}^n
|\theta_j|\] where H and G are appropriate basis matrices for the
two methods.
- What penalty is used in this case? Both of these lasso problems use
an L1 penalty on the coefficients, but only on the coefficients \(\alpha_{k+2}, ..., \alpha_n\) and \(\theta_{k+2}, ..., \theta_n\),
respectively. This is equivalent to placing an L1 penalty on the
discrete (k+1)-st derivative of the fitted values, which encourages
adaptive knot selection.
- Which method performs better in the empirical comparisons in 3.4? In
Section 3.4, the author compares trend filtering and locally adaptive
regression splines empirically on the simulated examples from Section
2.2 (the “hills” and “doppler” examples). He finds that for any fixed
value of the tuning parameter \(\lambda\), the trend filtering and locally
adaptive regression spline fits are practically indistinguishable, even
though the two methods are not formally equivalent for polynomial orders
k >= 2. Both methods seem to perform similarly well at adaptively
fitting these spatially inhomogeneous signals.
- What is observed for small values of \(\lambda\)? The author notes that for very
small values of \(\lambda\), slight
differences between the trend filtering and locally adaptive regression
spline estimates start to appear, but these are not practically
meaningful differences. In the asymptotic theory in Section 5, the
author proves that trend filtering and locally adaptive regression
splines converge to each other at the minimax rate when \(\lambda\) is of an appropriate order, but
the empirical comparisons suggest that the two methods perform similarly
for a wide range of \(\lambda\) values,
deviating only slightly for very small \(\lambda\).
3
- The author examines astrophysics data.
- Briefly describe the data used and what the goal of the analysis is.
The data comes from an astrophysics simulation model for quasar spectra.
A quasar spectrum shows the relative flux (brightness) of a quasar as a
function of wavelength. The true spectrum is believed to be spatially
inhomogeneous, with regions of rapid oscillations (absorption lines
called the “Lyman-alpha forest”) as well as smooth regions. The goal is
to estimate the true spectrum from noisy observations at n = 1172
wavelengths.
- Briefly describe the experimental setting (i.e., methods used,
parameters, frameworks/packages, tuning, comparison metric). The author
compares trend filtering, smoothing splines, and wavelet smoothing. Each
method is fit over a range of 146 complexity parameters (degrees of
freedom) ranging from 4 to 150. Smoothing splines are fit using the
smooth.spline function in R, while wavelet smoothing uses the wavethresh
package. Locally adaptive regression splines are not compared due to
their similarity to trend filtering. The estimates are compared based on
their mean squared error in estimating the true spectrum, averaged over
20 simulation replications.
- What method performs best and why, according to the author? Trend
filtering performs the best, achieving the lowest average squared error,
especially for lower complexity models. The author attributes this to
trend filtering’s ability to adapt to the local structure of the true
spectrum, fitting the smooth regions while also capturing the
high-frequency oscillations. Smoothing splines perform second best, but
do not adapt as well to the inhomogeneity of the signal. Wavelet
smoothing performs the worst, as it tends to overfit the noisy
Lyman-alpha forest. Even when comparing trend filtering to adaptively
chosen smoothing splines (different penalty parameters in smooth vs
wiggly regions), trend filtering performs just as well or better. The
strong performance of trend filtering is consistent with the theoretical
results showing that it achieves the minimax convergence rate for
spatially inhomogeneous signals.