Asessing Model Fit

Jake

30/05/2021

Introduction

Statistical Modelling

  • The goal of statistical learning to link a response variable Y to p different predictors, where \(f\) is a fixed by unknown function and \(\epsilon\) is a random error term with mean 0. We estimate this \(f\) through many different statistical learning approaches. \[ Y = f(X) + \epsilon \]
  • We predict this function for either prediction or inference
    • Prediction is where we don’t care about the relationship between variables, we just want a prediction for a set of inputs \[\hat{Y} = \hat{f}(X)\]
      • The accuracy of a prediction is dependent on the reducible and non-reducible error. We aim to minimise the reducible error as much as possible.
      • When we are aiming to get a prediction, the interpretability of our model isn’t as important.
    • Inference is where we want to understand the relationship between different independent variables \(X\) and the response variable \(Y\)
      • As we are interested in the relationship between variables, we prefer interpretable models for inference

Estimation

  • We can consider both linear and non-linear approaches for estimation of \(f\). The majority of these methods can be characterised as parametric or non-parametric. Consider the true underlying model for a set of points:

    True Model

    True Model

    • Parametric models make an assumption about the functional form/shape of \(f\) and then fit the model according to the model’s parameters. An example of this is the linear regression model.
      • While this reduces the problem to estimateing a set of parameters, the model we choose will usually not match the true unknown form of \(f\). We can attempt to fix this by adding more parameters, but this leads to overfitting. Consider the following linear model: \[f(X) = \beta_0 +\beta_1x_1 + ... \]
        Linear Model

        Linear Model

    • Non-Parametric do not make explicit assumptions about the functional form of \(f\) and therefore have the potential to accurately fit a wide range of possible shapes for f. 
      • However, as this approach does not reduce to the problem to a small number of estimatable parameters, a relatively large amount of observations are required to obtain an accurate estimate for \(f\)
      • Another thing to note is that higher flexibility in possible shapes is not always the best option as it can lead to overfitting. Consider the following highly flexible spline model, which is much closer to the real shape:
        Spline Model

        Spline Model

Supervised vs Unsupervised Learning

  • Most statistical learning problems fall into supervised or unsupervised categories
    • For supervised learning each observation \(X_i\) has an associated response measurement \(y_i\). In these situations we want to accurate predict the response for future observations.
    • For unsupervised learning each observations \(X_i\) does not have an associated response measurement. This is used to try to find relationships between variables or observations in a general sense.
      • For example cluster analysis, a popular form of unsupervised learning, intends to group data into different groups. Consider the following graphic. This data could be grouped into clusters, which allowed for an output to be assigned to each. Once these outputs are assigned and a relationship is determined, it turns into a supervised learning problem.
        Clusters

        Clusters

Regression v Classification

  • Variables can be characterised into either quantitative or qualitative, and this determines the approach to be taken.
    • Quantitiative/continuous/numerical variables take on numeric values and are modelled through regression.
    • Qualitiative/Categorical variables take on values of one of K different classes/categories and are modelled through classification.
  • Note that some statistical methods, such as KNN and random forests can be used in the case of either quantitative or qualitative responses.

Assessing Model Fit

MSE

  • In order to determine which model fits a dataset the best, we need to assess the model fit. In a regression setting the most commonly-used measure is the mean squared error (MSE): \[ MSE = \frac{1}{n}\sum^n_{i=1}(y_i-\hat{f}(x_i))^2\]
  • The most appropriate use of the MSE is on the test set, to see how a model performs on new data points. We want to choose the model that has the lowest test MSE.
    • We can always decrease the training MSE by increasing the flexibility of a model, but this may increase the loss MSE. Overfitting refers to the case where a less flexible model would have yielded a smaller test MSE.
    • The following plot shows how different models with different levels of flexibility impact the resulting test and training MSE:

Bias-Variance Tradeoff

  • There are two properties of a stiatistical learning model competiting within the MSE, these are known as the variance ans bias of the model.

\[ \mathbb{E}[(y_0-\hat{f}(x_0))^2] = \underbrace{Var(\hat{f}(x_0)+[Bias(\hat{f}(x_0))]^2}_{\text{reducible}} + \underbrace{Var(\epsilon)}_{\text{irreducible}}\]

  • The variance refers to the amount by which \(\hat{f}\) would change if it was estimated using a different training set of data

    • Generally more flexible models have higher variance as they are modelled in a way that was heavily dependent on the training set.
  • The bias refers to the error due to approximating a real world function by a model

    • Generally more flexible models have lower bias as they seem more correct on the specific set of training data
  • The following figure shows how test MSE, variance and bias can change over different levels of flexibility:

Classification Setting

  • The Bias-Variance tradeoff also tranfers to the classification setting.

  • The most common approach to quantifying the accuracy of our esatimate \(\hat{f}\) is the error rate. This utilises the same idea as the MSE, where we want to minimise the error rate for the test set.

    \[ \frac{1}{n}\sum^n_{i=1}\mathbb{I}(y_0\neq\hat{y}_0) \]

    • The test error rate is theoretically minimised by a simple classifier that assigns each observation to the most likely class given its predictors value. This simple classifier is called the Bayes Classifier
      • Note that in practice this is not possible to find as that would require us to know the exact form of the conditional function.
      • Therefore it would assign a test observation with variables \(X_0\) to the class \(j\) where the following probability is largest:

    \[ P(Y=j | X=x_0)\]

  • The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate, and therefore serves as the irreducible error for a classification problem. The overall Bayes error rate is given by the following equation, where we consider the highest classification probabilities, as assumed by the Bayes classification.

    \[ 1 - \mathbb{E}\left(\max_j P(Y=j|X)\right)\]

  • Our classification methods try to estimate the conditional distribution of Y given X, then classifies a new observation to the class with the highest estimated probability.

    • An example of this is the KNN classifier, which takes the K nearest neighbours for a new datapoint and assigns the point to a class based on the majority of these K neighnours.

    \[ P(Y=j|X=x_0) = \frac{1}{K}\sum_i\mathbb{I}(y_i=j)\]

    • Approaches such as this can lead to close approximations of the Bayes Classifier. In the following graph the KNN boundary is in black and the Bayes classification boundary is in purple.
    Bayes v KNN, K=10

    Bayes v KNN, K=10

  • For the KNN classification setting, higher flexibility means lower amount of neighbours

    • As a classification method becomes more flexible, the training error rate will decrease but the test error rate will have a similar shape to the test MSE. As seen in the following plot:
    Test v Training Error Rate

    Test v Training Error Rate