We will explore ways to improve the linear model

\[ Y = \beta_0 + \beta_1x_1 + \beta_2x_2+...\] * There are many shortcomings of linear regression + Prediction accuracy is not always the best, especially when \(p\) is large + Model interpretability is hard when \(p\) is large, therefore we may prefer a smaller set of important variables + Non-linearities: Linear assumption is almost always an approximation, and it can be a very bad approximation

Note that these methods can also be extended for other machine learning methods

Subset Selection

In all the following methods, we consider all the models at a set number of predictors \(p\) and pick the best at that level with some predictor insensitive metric such as \(RSS\) (or deviance for a glm). Then at the end we compare the models with predictor sensitive scores, such as \(AIC\) or \(CV\) to determine the best model overall.
Best subset selection considers all possible models given the predictors
- Picks the best of all possible models given a criteria, but is very computationally expensive:
\[ \sum^p_{k=0}\left(\begin{matrix}p\\k\end{matrix}\right) = 2^p\text{ models}\]
The algorithm is as follows:

Best Subset Algorithm

Forward selection starts from the null model \(\mathcal{M}_0\) and builds onto this step at a time, picking the best model for each amount of predictors
- Only looks forward one step at a time, therefore include variables that aren’t needed later and/or have an ending combination that is not totally optimal
- Can be used on datasets where \(p>n\) up to \(n\) predictors
- This is a guided search, and these models that are considered are generally good fits. Therefore the computation cost is much lower:
\[ 1+\sum^{p-1}_{k=0}(p-k)=1+\frac{p(p+1)}{2}\text{ models}\]
The algorithm is as follows

Forward Stepwise Algorithm

Backward selection is the opposite to forward selection, where we start from the full model \(\mathcal{M}_p\) and remove one variable at a time.
- Only looks backwards one step at a time, so struggles the same as forward and considers the same amount of models
- However backward selection cannot be used when \(p>n\)

Backward Stepwise Algorithm

Hybrid selection is a combination of forward and backward selection where variables are added one by one like with forward selection, but variables that no longer provide an improvement to model fit may also be removed.
- Attempts to closely mimic best subset selection while retaining the computational advantages of forward and backward selection.

Determining the Best Model

We need a metric to compare the models of different levels of predictors
- The model with more predictors will always have a lower \(RSS\)/higher \(R^2\)
- We want to get an estimate of the test error, done either through direct (CV) or indirect (mathematical adjustment) methods
- Note these formulas are for linear regression but can be generalised for other models.
Indirect formulas, noting \(d\) is the amount of predictors in the model:
- \(C_p\)/Mallow’s is an unbiased estimate of test MSE if \(\hat{\sigma}^2\) is an unbiased estimate of \(\sigma^2\)
\[ C_p=\frac{1}{n}(RSS+2d\hat{\sigma}^2) \]
- Akaike Infomration Criteria (AIC), this is proportional to \(C_p\) for least squares so give the same results
\[ AIC = \frac{1}{n\hat\sigma^2}(RSS+2d\hat{\sigma}^2) \]
- Bayesian information criteria, this applies a higher penalty for many predictors \((\log(n)>2\text{ for }n>7\)
\[ BIC = \frac{1}{n\hat\sigma^2}(RSS+\log(n)d\hat{\sigma}^2) \]
- Adjusted \(R^2\), unlike other measures we want a higher score rather than a lower score
\[ R^2_{\text{adj}} = 1-\frac{RSS/(n-d-1)}{TSS/(n-1)}=1-\frac{(1-R^2)(n-1)}{(n-d-1)} \]
Direct methods directly computes test error, such as CV
- Assumes less about the data’s underlying structure
- More versitile as it doesnt rely on model’s degree of freedom or \(\sigma^2\) being estimated
- However it is more computationally expensive, but this is not normally a problem with modern computing power

Shrinkage/Regularization

Shrinkage methods fit the model with all \(p\) predictors but with constraints so that the coefficients are shrunked to 0
- Shrinking coefficient estimates can significantly reduce their variance
- Can be used for subset selection

Ridge Regression

Ridge regression attempts to decrease the variance of a least squares estimate with the expense of an increase of bias. This is done by shrinking the coefficients towards 0 and hence decreasing the flexibility, however these do not ever reach 0.
- Therefore this approach works best when the least square estimates variances are high
Note that the predictors must be standardised such that their sample variance is 1. Therefore high variance predictors won’t dominate the minimisation problem.

\[ x_{ij}^{'}= \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum^n_{i=1}(x_{ij}-\bar{x}_j)}}\]

Ridge regression coefficient estimates \(\hat{\beta}^R\) are the values that minimise the following equation, where \(\lambda>0\) is a tuning parameter.

\[ \underbrace{\sum^n_{i=1}\left(y_i-\beta_0-\sum^p_{j=1}\beta_jx_{ij}\right)^2}_{RSS} +\lambda\sum^p_{j=1}\beta^2_j\] * As \(\lambda\) increases, the coefficients get smaller and therefore the variance decreases (at a cost of bias increasing) + \(\lambda\rightarrow\infty\) coefficients go to \(0\) + \(\lambda\rightarrow 0\) is the same as having no penalty, therefore coefficients go to their least squares solution + \(\lambda\) is chosen through crossvalidation

A useful measure for shrinkage plotting is the l2 norm of the ridge regression coefficients divided by the least squares coefficients
- This value ranges from 0 to 1
- When \(\lambda=0\) the value is 1 as \(\beta^R = \beta\) and when \(\lambda = \infty\) the value is 0 as \(\hat{\beta}^R=0\)
- Therefore small values indicate large shrinkage towards 0

\[ \frac{||\hat{\beta}^R_\lambda||}{||\hat\beta||}\]

Plotted are the coefficient estimates for differing \(lambda\) aka penalty values:

Ridge

Lasso Regression

Lasso is similar to ridge except the coefficients can reach 0, therefore lasso also performs subset selection. Note we also need to standardised predictors for lasso.
- Lasso therefore tends to perform better than ridge when there are lots of unrelated predictors (therefore they should have coefficient 0)
- Lasso is however preferable for interpretability as it does select predictors
The criterion to minimise is as follows:

\[ \underbrace{\sum^n_{i=1}\left(y_i-\beta_0-\sum^p_{j=1}\beta_jx_{ij}\right)^2}_{RSS} +\lambda\sum^p_{j=1}|\beta_j| \]

Lasso

Dimension Reduction

The idea of dimension reduction is to decrease the flexibility of the model (and in turn reduce the variance) by transforming the predictors into linear combinations of eachother. * Transform variables into \(Z_1, ..., Z_M\) where \(p>M\) and \(Z_m\) is a linear combination of existing predictors:

\[ Z_m = \sum^p_{j=1}\phi_{jm}X_j\] And, in the regression setting, fit the response:

\[ y_i = \theta_0 + \sum^M_{m=1}\theta_mz_{im}+\epsilon_i\]

A well chosen set of \(\theta\) can outperform least squares regression with all predictors due to the reduction in variance.
Note that dimension reduction is not feature selection as the \(Z_m\) still encompass all predictors.

Principle Components Analysis

Principle component analysis attempts to find the direction that summarises the most variance of the predictors and utilise these directions as regression predictors.
The first principle component is the direction of data where the observations vary the most.
- This has multiple interpretations
  - The direction that would have the largest variance/range if projected onto a line in this direction
  - A line closest to all observations when pltoted in the predictor space.
For 2 variables the maximisation problem to find the first principle component score would be:
- The interpretation is that \(z_{i1}<0\) would indicate a below average for score for all predictors
- The single principle component score attempts to summarise all predictors as 1 number
- This works well if the relationship between variables are mostly linear

\[ \mathbb{V}ar(\phi_{11}(x_{i1}-\bar{x}_1)+\phi_{21}(x_{i2}-\bar{x}_2)),\quad \phi_{11}^2+\phi_{21}^2 = 1\]

\[\therefore z_{i1} = \hat\phi_{11}(x_{i1}-\bar{x}_1)+\hat\phi_{21}(x_{i2}-\bar{x}_2) \] * The second principle component is a linear combination of variables that maximises the variance constrained by \(Z_2\) being uncorrelated with \(Z_1\) + The solution is the direction perpendicular to \(Z_1\) + For further dimensions they must be uncorrelated with all prior components

The following plot shows two predictors and the observations. The lines indicate the directions for the two principle components.

PCA Direction

The following plot shows the relevant principle component scores for the two components for each observation. As you can see the first principle component range/variance is much greater than the second.

PCA Scores

The first component includes the most information due to the higher variance being captured. As we can see the correlation between the data and the first pc is much greater than the second. THe range of the component is also much greater for component

First Component

Second Component

PCR

In the PCR approach we construct \(M\) principle components and use these components as predictors in a linear regression model fit by least squares.
PCR assumes that the direction where the predictors \(X_1,...,X_p\) show the most variance are the directions most related to Y
- If this is true PCR will lead to better results than least squares with all predictors due to less coefficients being estimated, therefore decreasing variance and chance of overfitting.
In PCR we have to standardise the features as well.
PCR tends to do well when the first few components capture most of the variation between predictors as well as relationship with the response.

Partial Least Squares

PLS is a supervised approach to dimension reduction. It attempts to find the direction that best explain both predictors and the response
- In PCR there is no guarentee that directions that best explain the predictors will best explain the resposne
PLS computes the first direction \(Z_1\) by setting each \(\phi_{j1}\) equal to the coefficient of the simple linear regression of Y onto each relative variable \(X_j\)
- Simple linear regression acts as a measure of correlation between the variable and the response
- The higher the correlation between the response and predictor, the higher the weight
For the second PLS direction each variable is adjusted for \(Z_1\) by being regressed on \(Z_1\) and taking residuals
- These residuals are kinda like the missing information that wasn’t taken into account by the first direction.

Linear Model Selection and Regularization

Jake

06/07/2021