- We will explore ways to improve the linear model
\[ Y = \beta_0 + \beta_1x_1 + \beta_2x_2+...\] * There are many shortcomings of linear regression + Prediction accuracy is not always the best, especially when \(p\) is large + Model interpretability is hard when \(p\) is large, therefore we may prefer a smaller set of important variables + Non-linearities: Linear assumption is almost always an approximation, and it can be a very bad approximation
- Note that these methods can also be extended for other machine learning methods
Subset Selection
In all the following methods, we consider all the models at a set number of predictors \(p\) and pick the best at that level with some predictor insensitive metric such as \(RSS\) (or deviance for a glm). Then at the end we compare the models with predictor sensitive scores, such as \(AIC\) or \(CV\) to determine the best model overall.
Best subset selection considers all possible models given the predictors
- Picks the best of all possible models given a criteria, but is very computationally expensive:
\[ \sum^p_{k=0}\left(\begin{matrix}p\\k\end{matrix}\right) = 2^p\text{ models}\]
The algorithm is as follows:
Best Subset Algorithm
- Forward selection starts from the null model \(\mathcal{M}_0\) and builds onto this step at a time, picking the best model for each amount of predictors
- Only looks forward one step at a time, therefore include variables that aren’t needed later and/or have an ending combination that is not totally optimal
- Can be used on datasets where \(p>n\) up to \(n\) predictors
- This is a guided search, and these models that are considered are generally good fits. Therefore the computation cost is much lower:
- The algorithm is as follows
Forward Stepwise Algorithm
- Backward selection is the opposite to forward selection, where we start from the full model \(\mathcal{M}_p\) and remove one variable at a time.
- Only looks backwards one step at a time, so struggles the same as forward and considers the same amount of models
- However backward selection cannot be used when \(p>n\)
Backward Stepwise Algorithm
- Hybrid selection is a combination of forward and backward selection where variables are added one by one like with forward selection, but variables that no longer provide an improvement to model fit may also be removed.
- Attempts to closely mimic best subset selection while retaining the computational advantages of forward and backward selection.
Determining the Best Model
We need a metric to compare the models of different levels of predictors
- The model with more predictors will always have a lower \(RSS\)/higher \(R^2\)
- We want to get an estimate of the test error, done either through direct (CV) or indirect (mathematical adjustment) methods
- Note these formulas are for linear regression but can be generalised for other models.
Indirect formulas, noting \(d\) is the amount of predictors in the model:
- \(C_p\)/Mallow’s is an unbiased estimate of test MSE if \(\hat{\sigma}^2\) is an unbiased estimate of \(\sigma^2\)
\[ C_p=\frac{1}{n}(RSS+2d\hat{\sigma}^2) \]
- Akaike Infomration Criteria (AIC), this is proportional to \(C_p\) for least squares so give the same results
\[ AIC = \frac{1}{n\hat\sigma^2}(RSS+2d\hat{\sigma}^2) \]
- Bayesian information criteria, this applies a higher penalty for many predictors \((\log(n)>2\text{ for }n>7\)
\[ BIC = \frac{1}{n\hat\sigma^2}(RSS+\log(n)d\hat{\sigma}^2) \]
- Adjusted \(R^2\), unlike other measures we want a higher score rather than a lower score
\[ R^2_{\text{adj}} = 1-\frac{RSS/(n-d-1)}{TSS/(n-1)}=1-\frac{(1-R^2)(n-1)}{(n-d-1)} \]
Direct methods directly computes test error, such as CV
- Assumes less about the data’s underlying structure
- More versitile as it doesnt rely on model’s degree of freedom or \(\sigma^2\) being estimated
- However it is more computationally expensive, but this is not normally a problem with modern computing power
Shrinkage/Regularization
- Shrinkage methods fit the model with all \(p\) predictors but with constraints so that the coefficients are shrunked to 0
- Shrinking coefficient estimates can significantly reduce their variance
- Can be used for subset selection
Ridge Regression
- Ridge regression attempts to decrease the variance of a least squares estimate with the expense of an increase of bias. This is done by shrinking the coefficients towards 0 and hence decreasing the flexibility, however these do not ever reach 0.
- Therefore this approach works best when the least square estimates variances are high
- Note that the predictors must be standardised such that their sample variance is 1. Therefore high variance predictors won’t dominate the minimisation problem.
\[ x_{ij}^{'}= \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum^n_{i=1}(x_{ij}-\bar{x}_j)}}\]
- Ridge regression coefficient estimates \(\hat{\beta}^R\) are the values that minimise the following equation, where \(\lambda>0\) is a tuning parameter.
\[ \underbrace{\sum^n_{i=1}\left(y_i-\beta_0-\sum^p_{j=1}\beta_jx_{ij}\right)^2}_{RSS} +\lambda\sum^p_{j=1}\beta^2_j\] * As \(\lambda\) increases, the coefficients get smaller and therefore the variance decreases (at a cost of bias increasing) + \(\lambda\rightarrow\infty\) coefficients go to \(0\) + \(\lambda\rightarrow 0\) is the same as having no penalty, therefore coefficients go to their least squares solution + \(\lambda\) is chosen through crossvalidation
- A useful measure for shrinkage plotting is the l2 norm of the ridge regression coefficients divided by the least squares coefficients
- This value ranges from 0 to 1
- When \(\lambda=0\) the value is 1 as \(\beta^R = \beta\) and when \(\lambda = \infty\) the value is 0 as \(\hat{\beta}^R=0\)
- Therefore small values indicate large shrinkage towards 0
\[ \frac{||\hat{\beta}^R_\lambda||}{||\hat\beta||}\]
- Plotted are the coefficient estimates for differing \(lambda\) aka penalty values:
Ridge
Lasso Regression
- Lasso is similar to ridge except the coefficients can reach 0, therefore lasso also performs subset selection. Note we also need to standardised predictors for lasso.
- Lasso therefore tends to perform better than ridge when there are lots of unrelated predictors (therefore they should have coefficient 0)
- Lasso is however preferable for interpretability as it does select predictors
- The criterion to minimise is as follows:
\[ \underbrace{\sum^n_{i=1}\left(y_i-\beta_0-\sum^p_{j=1}\beta_jx_{ij}\right)^2}_{RSS} +\lambda\sum^p_{j=1}|\beta_j| \]
Lasso
Dimension Reduction
The idea of dimension reduction is to decrease the flexibility of the model (and in turn reduce the variance) by transforming the predictors into linear combinations of eachother. * Transform variables into \(Z_1, ..., Z_M\) where \(p>M\) and \(Z_m\) is a linear combination of existing predictors:
\[ Z_m = \sum^p_{j=1}\phi_{jm}X_j\] And, in the regression setting, fit the response:
\[ y_i = \theta_0 + \sum^M_{m=1}\theta_mz_{im}+\epsilon_i\]
- A well chosen set of \(\theta\) can outperform least squares regression with all predictors due to the reduction in variance.
- Note that dimension reduction is not feature selection as the \(Z_m\) still encompass all predictors.
Principle Components Analysis
Principle component analysis attempts to find the direction that summarises the most variance of the predictors and utilise these directions as regression predictors.
The first principle component is the direction of data where the observations vary the most.
- This has multiple interpretations
- The direction that would have the largest variance/range if projected onto a line in this direction
- A line closest to all observations when pltoted in the predictor space.
- This has multiple interpretations
For 2 variables the maximisation problem to find the first principle component score would be:
- The interpretation is that \(z_{i1}<0\) would indicate a below average for score for all predictors
- The single principle component score attempts to summarise all predictors as 1 number
- This works well if the relationship between variables are mostly linear
\[ \mathbb{V}ar(\phi_{11}(x_{i1}-\bar{x}_1)+\phi_{21}(x_{i2}-\bar{x}_2)),\quad \phi_{11}^2+\phi_{21}^2 = 1\]
\[\therefore z_{i1} = \hat\phi_{11}(x_{i1}-\bar{x}_1)+\hat\phi_{21}(x_{i2}-\bar{x}_2) \] * The second principle component is a linear combination of variables that maximises the variance constrained by \(Z_2\) being uncorrelated with \(Z_1\) + The solution is the direction perpendicular to \(Z_1\) + For further dimensions they must be uncorrelated with all prior components
- The following plot shows two predictors and the observations. The lines indicate the directions for the two principle components.
PCA Direction
- The following plot shows the relevant principle component scores for the two components for each observation. As you can see the first principle component range/variance is much greater than the second.
PCA Scores
- The first component includes the most information due to the higher variance being captured. As we can see the correlation between the data and the first pc is much greater than the second. THe range of the component is also much greater for component
First Component
Second Component
PCR
In the PCR approach we construct \(M\) principle components and use these components as predictors in a linear regression model fit by least squares.
PCR assumes that the direction where the predictors \(X_1,...,X_p\) show the most variance are the directions most related to Y
- If this is true PCR will lead to better results than least squares with all predictors due to less coefficients being estimated, therefore decreasing variance and chance of overfitting.
In PCR we have to standardise the features as well.
PCR tends to do well when the first few components capture most of the variation between predictors as well as relationship with the response.
Partial Least Squares
- PLS is a supervised approach to dimension reduction. It attempts to find the direction that best explain both predictors and the response
- In PCR there is no guarentee that directions that best explain the predictors will best explain the resposne
- PLS computes the first direction \(Z_1\) by setting each \(\phi_{j1}\) equal to the coefficient of the simple linear regression of Y onto each relative variable \(X_j\)
- Simple linear regression acts as a measure of correlation between the variable and the response
- The higher the correlation between the response and predictor, the higher the weight
- For the second PLS direction each variable is adjusted for \(Z_1\) by being regressed on \(Z_1\) and taking residuals
- These residuals are kinda like the missing information that wasn’t taken into account by the first direction.