ID5059 Lecture 06 - Model Fit (cont.) & Model selection

C. Donovan
12 Feb 2018

Housekeeping

Academic strike: Thurs & fri
Lab clashes

Recap

We're trying to find the “best” model
Parsimonious in parameters and in covariates
\( R^2 \) and AIC
Model validation by distinct test and training datasets
Cross validation by leave-one-out, k-fold, bootstrap, …

Classification measures

What about categorical \( y \)?

[Step 1: make everything numeric]
As a simple starting point consider a logistic problem (chalk n talk)
A useful tool for investigating the performance in this case is a confusion matrix:

\[ \begin{array}{|l|c c|}\hline & y=0 (-ve) & y=1 (+ve) \\\hline \hat{y}=0 (-ve)& a & b\\ \hat{y}=1 (+ve) & c & d\\\hline \end{array} \]

So contains quantities for the correct prediction of class 0, correct prediction of class 1, and the two ways you may have made incorrect predictions (this scalees to more classes).

What about categorical \(y\)?

These are referred to as:
Correct positive: where \( \hat{y}=1|y=1 \)
Correct negative: where \( \hat{y}=0|y=0 \)
False positive: where \( \hat{y}=1|y=0 \)
False negative: where \( \hat{y}=0|y=1 \)
these are conceptually abundant in society (look at the back of a pregnancy-test packet) - the specificity of medical tests relates to these quantities.

What about categorical \(y\)?

Accuracy: propotion of predictions that are correct \[ \frac{a+d}{a+b+c+d} \]
Overall misclass: proportion of predictions that are incorrect \[ \frac{c+b}{a+b+c+d} \]
Sensitivity/Recall/True positive rate: proportion of true positives correctly predicted \[ \frac{d}{b+d}\qquad i.e.\quad P(\hat{y}=1|y=1) \]
Specificity/True negative rate: proportion of true negatives correctly predicted \[ \frac{a}{a+c}\qquad i.e.\quad P(\hat{y}=0|y=0) \]

What about categorical \(y\)?

Precision: proportion of predicted positives that are correct \[ \frac{d}{c+d}\qquad i.e.\quad P(y=1|\hat{y}=1) \]
False positive rate: proportion of true negatives that are incorrectly predicted \[ \frac{c}{a+c}\qquad i.e.\quad P(\hat{y}=1|y=0) \]
False negative rate: proportion of true positives that are incorrectly predicted \[ \frac{b}{b+d}\qquad i.e.\quad P(\hat{y}=0|y=1) \]

Need-to-knows:

Appreciate that all possible subsets selection under regression is restrictive in the number of covariates (remembering interactions may be required) it can work over.
Know that some efficient algorithms exist to make all possible subsets feasible for moderate numbers of covariates.
Be able to describe (e.g. schematically) how backwards, forwards and stagewise selection methods work for the ordinary-least-squares regression case.
Understand the philosophical difficulty in interpreting an automatically selected model given their behaviour under perturbation.

_Ref: Section 3.3 THF (2nd Ed.). Section 3.4 for more advanced approaches (not examinable, interested parties only). McLeod & Xu (bestglm) give an overview and R examples

Where are we?

We have the linear regression model
\( X \) matrix with rows of input values
Columns can be:
- Numeric inputs
- Transformed numeric inputs
- Basis expansions
- Predictor variable interactions
In any event, the model is linear in the parameters
The aim (for this example) is to find parameter values that minimise RSS (or some loss function)

What haven't we considered?

Basically statistics' - no notable inference here/yet.

By making assumptions about input and output distributions we can calculate:
Variance of output
Standard errors for parameters
Confidence intervals for parameters (95% CI is roughly \( \pm \) 2 SE)
Prediction limits for models

Restriction to subsets of inputs

We're going to select simpler models. Various reasons:

Setting some coefficients to zero may increase bias, but decrease variance – improving predictive accuracy (lower generalisation error)
Concentrating on the important input variables can lead to better interpretation of the model (caveat emptor)
Easier computation

Often simply referred to as parsimony - which is deemed a good thing

Occam's razor: after William of Ockham “Entities must not be multiplied beyond necessity” (Non sunt multiplicanda entia sine necessitate) - although that's John Punch's (1639).

All possible subsets

Use a brute-force algorithmic approach to search through all possible models - obviously a difficult combinatoric problem. COnsider:

40 potential covariates which we will only consider as entering marginally i.e. as main effects gives about \( 10^{12} \) potential models.
Considering even only first order interactions for this problem makes much more massive (in the order of \( 10^{200} \) models).
Allowing a 100th of a second to fit each model our sun will have expired well before finishing.
There are algorithms (such as leaps and bounds) which have made this problem far more tractable, but 100+ variables is considered a big problem
The common algorithms are restricted to basic ordinary-least-squares problems.

Step selection methods

Very simple conceptually and are far smaller problems than searching all subsets.
They may not (probably won't) return the optimal' model
We must set our objective measure - such as the adjusted-\( R^2 \) or AIC

Forwards selection

Begin with a very simple model e.g. \( \mathbf{y}=\hat{\beta}_0 \)
Add the best' covariate to the current model e.g. the \( x \) that explains the greatest variance on its inclusion, or decreases the AIC most etc.
Repeat step 2 until a stopping rule is achieved e.g. the adjusted-\( R^2 \) is no longer increases, AIC is minimised etc.

Backwards selection

Begin with a very complex model e.g. contains all covariates \( \mathbf{y}=\mathbf{X}\hat{\boldsymbol{\beta}} \)
Remove the least informative covariate from the current model e.g. the covariate with the largest \( p \)-value for the \( t \)-statistic from its parameter estimate, or decreases the AIC most, etc.
Repeat step 2 until a stopping rule is achieved e.g. the adjusted-\( R^2 \) is no longer increases, AIC is minimised etc.

Stagewise selection

Begin with a very complex model,
Remove the least informative covariate from the current model
(after first iteration) Determine if one of the excluded variables could be profitably exchanged with a variable in the model (such as a net decrease in the AIC) - swap if this is true.
Repeat from step 2 until a stopping rule is achieved e.g. the adjusted-\( R^2 \) is no longer increases, AIC is minimised etc.

The same rationale can be applied but from a very basic starting model, increasing in complexity.

Further Considerations

Still ongoing work in efficient methods to select models even within this simple regression context - see Efron et al. 2004
Don't read it in detail - however the discussion p452 onwards demonstrates the still-contentious nature of this subject
The brief coverage here represents methods established 30 years ago.

Don't expect your automated model selection to be clever:

Model is likely to be sub-optimal even for our selected measure of best'.
Different selection methods can (very likely) give contradictory final models.
What thresholds should be employed in the selection process? (includes in and out criteria for stagewise).
Clearly the possibility of truly irrelevant variables being included just by chance.
In the case of covariate correlations proxies may be selected.
Interpretation of model components must be tentative.