We are now going to provide an overview for how we go about fitting a regression model to a supplied training set of observations. Before we discuss the process for training a model, we need to introduce methods for scoring such models.
Suppose that the variables \(X\) and \(Y\) are related according to a population model of the form \(Y = f(X) + \varepsilon\), where \(f\) and the distribution of \(\varepsilon\) are unknown to us. We are provided with a training set of n
paired observations of the form \((x_i, y_i)\). Assume that a model \(\hat{f}\) has been proposed as an estimate for \(f\). We can use the training data to score the model provided by \(\hat{f}\) by calculating either of two scores called the Mean Squared Error (MSE) and the Sum of Squared Errors (SSE). The process for calculating these scores is as follows:
\[SSE = \sum\limits_{i=1}^n \hat \varepsilon_i^2 \hspace{4 em} MSE = \frac{1}{n}\sum\limits_{i=1}^n \hat \varepsilon_i^2 \hspace{2 em}\]
We generally prefer models with lower SSE and MSE scores. Note that if we have two models, Model A and Model B, and if \(SSE_A < SSE_B\), then it will also be true that \(MSE_A < MSE_B\). In other words, these two metrics will always agree on the better of two models.
The process for finding a fitted model \(\hat{f}\), given a training set of n
paired observations \((x_i, y_i)\), can be summarized as follows:
Determine a class of models that \(\hat{f}\) will be selected from. This collection of models is called the hypothesis space for the training process. The decision as to what to use as the hypothesis class might be motivated by a graphical analysis, or by specific domain knowledge. In other cases, we will simply try out many different hypothesis spaces and see what works best.
Select the model from the hypothesis space that best fits the training data according to either Mean Squared Error (MSE) or the Sum of Squared Errors (SSE). This is our fitted model.
The process of searching through the hypothesis space to find the best model in that class of functions is typically performed by statistical software such as R.
Assume that we are provided with a training set consisting of 20 paired observations \((x_i, y_i)\). The data is stored in vectors named x
and y
. We plot the data below.
plot(y ~ x, pch=21, col="black", bg="limegreen", cex=1.4,
xlim=c(0,4), ylim=c(0,16), xlab="X", ylab="Y",
main="Training Data")
The scatter plot seems to suggest that there is a “curved” relation between the variables \(X\) and \(Y\). Motivated by that observation, we decide to consider quadratic fits for this data. In other words, we are going to select as our hypothesis space all functions of the form: \(\hat{f}(X) = b_0 + b_1 X + b_2 X^2\). Two such models are provided below.
The figure below shows the graphs of these two functions superimposed on the training data. Take a moment to inspect these to decide which model you think provides a better fit.
We can assess the quality of fit for the two models by calculating either SSE or MSE. We will use R to calculate SSE for both of the models.
pred_1 <- 6 - 2.5 * x + x^2
error_1 <- y - pred_1
sse_1 <- sum(error_1^2)
sse_1
[1] 49.50108
pred_2 <- 4 - 1.5 * x + x^2
error_2 <- y - pred_2
sse_2 <- sum(error_2^2)
sse_2
[1] 39.28585
Of the two models, Model 2 has the lower SSE, so it is our preferred model. However, there is no reason to assume yet that is the best possible quadratic model. In fact, if we rely on R to search the hypothesis space of all quadratic models, we find that the best such model is given by the function:
\[\hat{f}(X) = 3.96 - X + 0.8 X^2\]
We calculate the SSE for the optimal quadratic model as follows:
pred_opt <- 3.96 - x + 0.8 * x^2
error_opt <- y - pred_opt
sse_opt <- sum(error_opt^2)
sse_opt
[1] 33.59416
We will often want to consider more than one class of model when performing a supervised learning task. When doing so, we will end up with multiple models that have been fit to the training data. In particular, we will have one fitted model for each hypothesis class under consideration. In this section, we will discuss methods to select a model from a set of fitted models.
The figure below displays plots for three models that have been fit to the training data that we have been considering in this lesson and in the previous one. Model 1 is the best-fitting linear model. Model 2 is the quadratic model that provides the best fit for the data. Model 3 is the optimal degree 9 polynomial model.
Each of the three models considered is the best-in-class for the hypothesis space under consideration. It is important to note that in this scenario, the hypothesis spaces are nested. The hypothesis space of all degree 9 polynomials contains the space of quadratic polynomials, which in turn contains the space of all linear functions. As a result, we are guaranteed that the Model 3 will have the lowest MSE and SSE of all three models, since it is has the best score of ALL degree 9 polynomials. Does that mean that it is the best model of the three?
A visual inspection of the plots would suggest that Model 2 provides the best fit for the data.
Model 1 is too simple. This linear model fails to capture the apparently nonlinear relationship in the training data. This is a case of underfitting, where the model being considered in too simple to capture the true nature of the relationship.
Model 3, on the other hand, seems unnecessarily complicated. The curve passes very near each of the training points, but indicates that there are complexities in the relationship that probably don’t actually exist. This is a classic case of overfitting, where the model being considered is too flexible, and trains itself to the noise in the data. An overfit model will perform unreasonably well on the training data, but will fail to generalize well to new, unseen observations.
We train our models by selecting ones that perform well on the training data. But we are ultimately interested in finding models that will generalize well to new observations. When we put our model into production, it is new observations for which it will be used to make predictions. To obtain an estimate of how well our models will perform when encountering new, previously unseen observations, we will generally remove some portion of the data available to us at training. We do not allow the models to train on this holdout data. The observations in the holdout set are unseen by the fitted models, and can be used to measure how well the model will generalize to new data.
In particular, we typically split our labeled data into three sets: the training set, the validation set, and the test set. The purpose of these three sets will be explained below.
The training set is used to train the model. We will provide the training set to a machine learning algorithm as input, and the algorithm will search the specified hypothesis space to find the best-in-class model, as measured by training MSE/SSE.
The validation set is used to compare different fitted models. When we are considering multiple hypothesis spaces for a supervised learning problem, we can compare the resulting fitted models using the validation set, which was not seen by any of the models during the fitting stage. We typically select the model with the lowest validation MSE/SSE as our final model.
The test set is used to assess the performance of our final model. Using the validation set to select a fitted model introduces the possibility of selecting a model that overfits to the validation set. As a result, the validation MSE/SSE might not be a reliable measure of model performance. We calculate the test MSE/SSE to provide the final estimate of the selected model’s performance. Note that the test set is only used once, at the end of the model building process. It is only used after we have selected our final model, and does not influence any of our model-building decisions.
Let’s return to our previous example. Assume that in addition to the 20 training observations that we have been considering (shown in green), we also have access to 10 validation observations (shown in blue), as well as 10 test observations (shown in red). A scatter plot including all three sets is shown below.
Let’s start by calculating the training MSE for each of the three models. As we have already observed, we expect that Model 3 will have the lowest training MSE.
Model 1 Model 2 Model 3
2.7392282 1.6787222 0.9738137
We have confirmed that Model 3 does, in fact, have the lowest training MSE. However, an analysis of the plots suggests that Model 3 might be overfitting. Let’s compare the performance of the models on the validation set.
Model 1 Model 2 Model 3
5.537570 3.689066 7.228711
We see that Model 2 has the lowest validation MSE. We will select this to be our final model. We will now calculate the test MSE for this model alone.
Test MSE: 3.815343
For the purposes of summary, we will now show all of the calculated SSE scores together.
Model 1 Model 2 Model 3
train_mse 2.739228 1.678722 0.9738137
val_mse 5.537570 3.689066 7.2287108
test_mse NA 3.815343 NA
Notice that as we considered moved from less flexible fitting methods (such as considering only linear model) to more flexible fitting methods (such as considering degree-9 polynomials), we saw two things occur:
It is very typical to see this sort of behavior when working with several regression methods with varying degrees of flexibility. The following figures illustrate typical relationships between model flexibility and both training and validation MSE. In each of the examples shown, the black curve in the left plot represents a known population model, while the other curves are models that have been fit to the sample data that is shown. On the right side, the grey curve represents the training MSE as a function of flexibility, while the red curve represents validation MSE.