ID5059 Lecture 4 (and 5) - how good is the model fit?

C. Donovan
9 Feb 2018

Lecture 04 - housekeeping

  • Labs start next week at 1pm – 3pm on odd Fridays (weeks 3, 5, 7, 9 (11)) in the John Honey MSc lab. If you can't make this time speak to me.
  • You should all have access, logins, and one-time passwords.
  • Project 1 will appear on moodle today
  • Project groups will be formed after 3rd week.

Big picture

\[ \mathbf{y} = f(\mathbf{X}) + \mathbf{e} \]

  • We want to predict y using a set of x, but don't know the function f that connects them
  • We've seen examples of building complex things/functions from simple pieces/bases
  • If we have a sensible measure of agreement between the model predictions and observed y, then we can optimise the parameters of f (details later, but we've seen OLS)
  • We can make lots of candidate f, but which one is best?

Cool stuff we can already do

  • We can make almost arbitrarily complex functions of x and fit these to y if we're happy using RSS as the measure of good

Cool stuff we can already do

A tricky problem

Cool stuff we can already do

A tricky problem

Cool stuff we want to do

  • We want to automatically choose between lots of candidate models and find the best one
  • This requires another different consideration of what is good'.
  • This is HUGELY IMPORTANT: Naivity will not serve us well.

Recap

  • Use basis functions to fit data via the linear regression equation
  • Nonlinear bases obtained from linear combinations of parameters and x-values
  • Linear (and even constant) basis functions are often fine
  • More parameters means more cost and more chance of overfitting (later)
  • Not obvious how best to choose knot positions
  • Find parameters by minimising some loss function (usually squared error)

Need-to-knows: Recap

Conceptually:

  • polynomial regression as a linear combination of basis functions.
  • how curves may be created as a linear combination of basis functions.
  • how to construct a univariate piecewise constant basis e.g. a bin smooth.
  • how to construct a multidimensional constant basis (graphically and via tensors).
  • b-spline bases.
  • tensor product b-spline basis functions.

Training error, Generalization error & Overfitting

Key points

  • Training error: how much our model fails to predict the data used to develop it.
  • Generalisation error: how much our model fails to predict data not used in model development i.e. real error.
  • Overfitting: fitting a model that is too complicated'.

Training error, Generalization error & Overfitting

Overfitting - which can variously mean (these are different angles of the same thing) the model is:

  • more complicated than the underlying signal we are trying to capture.
  • capturing some noise and deeming this to be signal.
  • too closely approximating our sample, but not generally not so good for other samples.

from our perspective, let's consider it not being optimal in terms of generalisation error by being too complex.

  • It follows there is underfitting: fitting a model that is too simple' - not being optimal in terms of generalisation error by being too simple.

Some measures of model fit

  • We do not know the true model for the systematic component of the system under study - this is the rule not the exception.
  • We seek a parsimonious model - given the point above this is not really for interpretative purposes, rather to improve generality.
  • We must fish through our covariates to find the best' model.
  • This may be a large problem - automated approaches are necessary
  • Automated approaches require objective measures

\(R^2\) - the ubiquitous R-square

  • Can be expressed in many different ways giving different interpretations
  • Two are given here:

\[ R^2=1-\frac{(n-1)^{-1}\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-1)^{-1}\sum_{i=1}^n(y_i-\bar{y})^2} = 1-\frac{{\mathrm SSE}/(n-1)}{{\mathrm SST}/(n-1)} \]

  • SSE is the sum of squared errors for your model, the SST is the total sum of squares for your data.
  • This formulation shows the interpretation as “the proportion of variance explained by the model”

The vanilla-flavoured \(R^2\)

  • An alternative formulation (which justifies the name \( R^2 \)) is
    \[ R^2=[Corr(\hat{\mathbf{y}}, \mathbf{y})]^2 \]
  • where the \( \bar{\hat{y}} \) is the mean of the fitted values, \( \mathbf{y} \) and \( \hat{\mathbf{y}} \) are the vectors of observed and predicted response values.
  • So it is the square of the correlation between observed reponse values and those predicted under the model

The vanilla-flavoured \(R^2\)

  • Bounded by 0 and 1,
  • \( R^2=1 \) is a perfect agreement between the model predictions and the observed data
  • The value 1 would indicate that all of the variability in the response is accounted for by the model
  • A value of zero indicates the worst possible fit, and there is no relationship between the observed data and the model predictions.
  • These formulations are not truly equivalent - only under the conditions they are usually applied i.e. given a model fitted using ordinary least squares.

Why the \(R^2\) can suck

  • Usually only reflects training error, not generalisation error
  • Offers no protection therefore against over-fitting
  • Don't use it to choose between models of different complexity

Can we fix it a bit?…

A penalized \(R^2\)

\( \rm{adjusted}-R^2= \)

\[ 1-\frac{(n-p-1)^{-1}\sum_{i=1}^n(y_i-\hat{y}_i)^2}{(n-1)^{-1}\sum_{i=1}^n(y_i-\bar{y})^2} =1-\frac{SSE/(n-p-1)}{SST/(n-1)} \]

  • SSE and SST as before
  • so, the proportion of variance explained by the model, but has \( df \) included
  • it does not necessarily increase with the complexity of the model
  • it has a penalty term that introduces two aspects: fidelity to the data and parsimony.
  • (sort of) Bounded by 0 and 1, with large values being a relatively good fit

Akaike's Information Criterion (AIC)

A penalized Likelihood

  • The most widespread measure for model selection within statistics
  • Theory is extensive, but its form is simple:

\[ AIC=-2\ell + 2p \qquad (= 2n(log_e(RSS/n)) + 2p) \]

  • where the \( \ell \) is the log-likelihood, and \( p \) is the number of parameters in the model.
  • so what is likelihood? I'm glad you asked…

Likelihood in a nutshell/5 minutes

  • A measure of how relatively likely the data is, given a model of signal and noise.
  • Noise model is a formal probability density function.
  • Better models are ones that make the observed data more likely if they were true (i.e. more likely to be generating what we see).
  • Linear reg example (chalk-n-talk').

Akaike's Information Criterion (AIC)

Penalized Likelihood

\[ AIC=-2\ell + 2p \qquad (= 2n(log_e(RSS/n)) + 2p) \]

  • We seek the model that minimises this value (i.e. head towards -ve \( \infty \))
  • It can be seen that complex models are penalised by the inclusion of \( p \).
  • Similar to the adjusted-\( R^2 \) there is a measure of fidelity of the model to the data and a penalty term
  • The value of the AIC has no particular interpretation - use relatively.

Model Validation

Model fit against unseen data – two approaches:

  1. Hold back some of the data
    • Data is genuinely unseen at the model derivation stage
    • Works poorly for small datasets
  2. Use all the data in a managed way
    • Data is unseen at chosen iterations

Cross-Validation (CV)

Model fit against new' data

  • Very important - this is a very intuitive measure which is used extensively within data-mining/ML/predictive modelling.
  • A reasonable definition of a good model - a model that best predicts data that is as-yet unseen
  • Put another way, a model that best predicts data that was not used in the construction of the model in question

Model fit against `new' data

  • Parameter or parameters (\( \theta \) say) that govern our model of the signal (\( f \)),
  • A criterion for the best' value of \( \theta \) would minimise the error in predicting unseen data.
  • The omitted values represent unobserved data - CV indicates the model's generality in predicting future observations
  • smaller is better

\(k\)-fold cross validation

  • Leaving each observation out in turn has been criticised for not perturbing the data enough
  • A simple variant called \( k \)-fold CV was proposed - the excluded amount of data is greater than one observation at each iteration
  • The data is folded' by the number specified e.g. a 5-fold CV would entail:
  1. folding/dividing the data into 5 roughly equal portions,
  2. fitting the model 5 times, omitting 20% of the data for each iteration of model fitting.
  3. the subsequent results are summarised.

Out-Of-Bag (OOB)

  • Bootstrapping revolves around resampling \( n \) elements from a pool of \( n \) elements with replacement
  • Almost certain that a proportion of data not selected
  • The non-sampled fraction of the data can be used as validation data
  • This Out-Of-Bag (OOB) sample can be used to test our model performance

Some observations

  • We usually estimate our parameters on the basis of training error (maximum likelihood, RSS, misclass error).
  • Complexity' (which may be a parameter(s)) need to be assessed on the basis of generalisation error.
  • Penalised fit measures attempt this based on the fit to the sample. Problem: how many parameters do we really have?
  • Cross-validation/validation measures assess this by simulating' new data.
  • Generalisation error is our focus, we do not know appropriate complexity a priori.

Summary

  • We have penalised fit measures that guard somewhat against over-fitting
  • We have test/valdiation data methods that effectively simulate new data
  • A single validation dataset is easy and efficient, but only one estimate of generalisation/test error
  • k-fold validation gives k estimates of gen error via random folds of data
  • OOB uses bootstrapping, with the randomly non-sampled bits giving gen. error (so as many as you bootstrap)
  • The last two require lots of computation, but give uncertainty on error

[obvs?] All assume our data is similar to what future data will be like (representative of future signal/noise)