STA 363/663 Lab 3: Test Metrics and the Validation Method

Complete all Questions and submit final PDF under Assignments in Canvas.

The Goal

In our last lab, we worked on creating visualizations of data in R. Now, we are ready to start the process of building models based on our goals and what we find when we examine the data.

We have been talking in class about how to create metrics that will allow us to evaluate the predictive accuracy of a model, meaning assessing how well the model performs at the task of prediction. In statistical learning, we often use these metrics to help us decide which model to use for a given data problem. However, it is not always possible to directly compute these metrics. Sometimes, we have to estimate.

The process of estimating test metrics often involves cross-validation. Today, we are going to practice some of the code structures that we need to perform the validation approach, the first type of cross-validation that we will learn.

Open up an RMarkdown file, and delete everything after Line 10.

Note: In this lab, I will ask you to show me code in a variety of places. This means you do NOT have to hide all your code.

Let's start!

The Data

Today, we are going to work with a data set on the Super Bowl. We have information from all of the Super Bowl games played from 1967 to 2020. Only the 2021 game information is excluded.

Before starting this lab, make sure you have set up your RMarkdown and downloaded the data by watching the videos on Canvas.

The very first step in starting any new Markdown file is to load in the data that you need in a chunk in your Markdown. If you do not load the data inside of a chunk, your Markdown will not Knit.

You will notice that the name of the data set is a little long, which can be annoying to type over and over in R, so let's change it. To do that, we store a copy of the data under the simpler name SuperBowl, and then remove (rm()) the original version of the data, using the following two lines of code:

 SuperBowl <- The_Big_Game_Stats
 rm(The_Big_Game_Stats)

Now, go ahead and open the data. For some folks, there will be a few empty rows at the end. If not, great, and you can move to the next section. If there are empty rows, here is how to fix it!

 SuperBowl <- SuperBowl[1:54,]

Exploring our Options

Our goal for today is to estimate the winning score of a Super Bowl game (Winner_Pts) based on the number of first downs the winning team had during the game (Winner_FirstDowns).

library(ggplot2)

Using ggplot2, make a plot to explore the relationship between these two variables of interest. Note: For every plot you do from now on in this course, whenever I tell you to make a plot, that means a plot with labelled axes, and a title like "Figure 1: Winning Points vs. First Downs".

We are going to consider two different choices of f: polynomial regression and least squares linear regression (LSLR).

Add a fitted LSLR line to your graph from Question 1. Hint: There is code on how to do this in Lab 2.

Add a fitted third order polynomial to your graph from Question 1. Hint: There is code on how to do this in Lab 2.

Which of the two model choices (LSLR regression or third order polynomial regression) is a more flexible model choice?

If we were to choose LSLR for our model, do you think it is more likely that we would under-fit the data or over-fit the data? Explain.

Intermission: Let's talk Notation

Once we have plotted our two different choices of f, it is helpful to actually write out the estimated $\hat{f}(X)$.

Doing things like writing down the form of a model often involve parameters. Parameters in statistics are generally represented as Greek letters. How can we write Greek letters and other mathematical notation in our Markdown file so they show up when we knit?

If you want to write mathematical notation, we need to tell Markdown, "Hey, we're going to make a math symbol!" To do that, you use dollar signs. For instance, to make $\hat{\beta}_1$, you simply put $\hat{\beta}_1$ into the white space (not a chunk) in your Markdown.

Go ahead and do that. See how the dollar signs change colors? Also note that if you hover your mouse over what you just pasted, the mathematical symbol we want will appear.

If you want the symbol to appear on its own line in your Markdown, you need to put two $ signs at the beginning and end of the line (so $$). Try that now.

The same thing works for other mathematical symbols. Let's say I want to write out a LSLR regression line. The code is $\widehat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1 X_i$. We'll notice that we used \hat for the $\hat{\beta}_1$ but \widehat over $ Y_i $. Why? Because we usually replace $ Y_i $ with a longer word, so it needs a bigger (wider) hat.

A word of caution. You must make sure that both the dollar sign at the beginning and end of your mathematical expression is touching text. In other words, $\hat{y}$ will knit just fine but $\hat{y} $ will yield an error. This is important. Your document will not knit if you forget! Now, you can put spaces inside, meaning that $\hat{y} = 4$ is fine, but the beginning and end can have NO spaces.

Write down the form of ${f}(X)$ for both the LSLR model and the polynomial regression model. Hint: Writing out ${f}(X)$ involves symbols, not numbers.

Using the Training Data: Fitting the Models

Now that we have decided on the two models we are considering, it is time to train, i.e., fit, the models in R.

Write down $\hat{f}(X)$ for the LSLR model. Hint: Writing out $\hat{f}(X)$ involves numeric estimates for the parameters, not symbols.

Write down $\hat{f}(X)$ for the polynomial regression model. Hint: When you fit a polynomial model in R, you need to use code like lm( Y ~ X + I(X^2) + I(X^3), data = ).

Predictive Accuracy

Now that we have our two different models, and have used some visualizations to explore what they look like, we want to start to evaluate how well these models might do at our assigned task: prediction.

When our goal is to assess prediction, we generally check to see if we have access to test data, which means data that were not used to train our model that we can use to make predictions and assess our model's ability to make accurate predictions. In this case, we have a very small test data set. Where you ask? Well, the 2021 SuperBowl has just taken place. The score for the winning team was 31, and they had 26 first downs.

Using Model 1 (the LSLR model), make a prediction for the winning score for the 2021 SuperBowl. Show your steps (and don't use the predict function.) State the prediction, and the value of the residual for the 2021 SuperBowl.

Repeat the same steps, but for Model 2 (the polynomial regression model).

Based on what we have computed so far, which of the two models more accurately predicted the winning score of SuperBowl 2021?

Now, this test data set is very small. It is only row. We don't really want to assess predictive accuracy based on only one row, as it is possible that that one row is an anomaly, meaning correct data that happens to be unusual.

Also, predictive models are generally used to predict what happens before we know the score, not after. This means that there are situations when we do not have test data that we can use to assess predictive accuracy.

The is the goal of a powerful set of statistical learning procedures known as cross-validation techniques. How do we assess predictive accuracy when we do not have a test data set? Let's work through one such technique, called the validation method.

In the validation method, we create "test" data by stealing some of the rows from our training data. Let's try this out with our Super Bowl data.

The Validation Method

So, let's go back to the beginning. We have a data set, and we are told that our goal is prediction. Let's ignore the 2021 Super Bowl for the moment and say we do not have test data. This tells us that performing the validation method might be needed to estimate predictive accuracy. Before we do any model fitting, this means we need to create some test data.

When we create two data sets from one, we run into two problems. The first is that we reduce the sample size in the data we use for model training, which means we may have a less accurate estimation of the model parameters. Because of this, we make sure that when we split the data, more data ends up in the CV training set than in the CV test set.

Using an 80/20 split, determine which rows of data are going to be used for the CV training data set. Print out the row numbers that you have chosen. Show the code you used, and annotate it. Note: Annotate means using a line in your chunk with a # in the front to add a brief comment about what each line of your code does. For instance, # Set a random seed.

Note: We are going to be performing the code for the validation method one steps at a time in this lab. At the end, I am going to ask you to do it all at once.

Explain why it is important to use random sampling to determine which rows in the original training data end up in the CV training set.

Now, actually create the CV test and CV training data sets based on the rows you selected in the Question 12. State the dimensions of the two data sets. Also, show the code you used to create the data sets, and annotate it.

At this point, we want to check two things. First, make sure the number of rows in your CV training and CV test data set add up to the original number of rows in the training data. Second, open up the test set and make sure that the rows of data match up to the row numbers you printed out as your answer to Question 12.

Now that we have our two data sets, it is time to actually train (fit) the two models we are considering.

Train both a LSLR model (Model 1) and a third order polynomial model (Model 2). Show the code you used to do so, and annotate it. Write out both trained models (regression lines). Hint: (1) Remember this means using the numeric estimates for the parameters. (2) Training a model does NOT mean drawing a graph. Your answer should be an equation.
Compare the trained models from Question 15 to the trained models (fitted models) you got when you used the entire training data set in Questions 7 and 8. Are the estimates of the parameters the same? Do we expect them to be?

Okay, so we have divided our original training data, and we have used the CV training data to fit our models. Now, it is time to use our CV test data and make predictions.

Using the LSLR model trained on the CV training data, make a prediction for the first row in the CV test data set. Don't use predict; compute it mathematically.

Using the LSLR model trained on the CV training data, make predictions for all the rows of the CV test data set. (Now you can use predict!). Store the predictions as an object called predsLSLR. Show the code you used to do so, and annotate it. Show that the prediction you obtained in the previous question is the first element in the predsLSLR vector. Note: It is okay if the values are little different due to rounding!

Using the polynomial regression model trained on the CV training set, make predictions on the CV test data set. Store the predictions as an object called predsPoly. Show the code you used to do so, and annotate it.

Using the CV test data, estimate the test RSS and test MSE for both Model 1 and Model 2. State the numeric values, and also show the code you used to compute your answer.

Based on your results, which model has the highest predictive accuracy? Explain.

Motivating Simulation

Great, are we done? Do we choose a model and move on? Well...not quite. At this moment, several of your classmates are working on this same lab, but they may have picked a different random seed from you. Would that matter?

Take your code from Questions 12 - 20 (all the code needed to perform the validation method). Put all of this code in one chunk of code. We usually call a chunk of code with multiple lines of code that work together to perform a task a script. In this case, I am asking you to build a script that can perform the validation method, all in one chunk, so you only have to press play once.

In this chunk you have just created, change the random seed to 497, and hit play. Does your answer to Question 21 change?

What happened???

What we have just uncovered is a problem of the validation method when we are working with a small data set. This is not a big concern when we have a very large original training set, but here, we only have 54 rows. This means that the estimates of the test RSS and test MSE we obtain from the validation method are entirely dependent upon which rows in the original training set made their way into the CV training set when you drew a random sample.

This means our answer to the question "Which model has higher predictive accuracy?" can change depending on the choice of random seed.

Well, that's not ideal. This means that, especially with small training data sets, the process of using the validation method to estimate the test MSE is what we call a high variance estimation procedure. In other words, the estimate can change quite a bit depending on which data rows ended up in the CV training vs. the CV test data set. Small changes in the data sets can lead to big changes in the estimates.

Next Steps

So is it never okay to use validation? No. With a very large original training set, the validation approach can be very effective and very efficient computationally. However, with smaller data sets, we are going to want to try something a little different.

There are two more methods that we will explore that will have better estimation properties on smaller data sets: LOOCV and k-fold. These two techniques are similar to validation, but require a clever way to create test data. This will be the next concept we cover in class!

Turning in your assignment

You must submit a PDF document to Canvas. No other formats will be accepted.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2021 February 8.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.

The data set used in this lab is provided on Kaggle. Bozsolik, Timo. Superbowl History 1967 - 2020. Version 3. Accessed February 4, 2021. https://www.kaggle.com/timoboz/superbowl-history-1967-2020.