In our last lab, we worked on creating visualizations of data in R. Now, we are ready to start the process of building models based on our goals and what we find when we examine the data.
We have been talking in class about how to create metrics that will allow us to evaluate the predictive accuracy of a model, meaning assessing how well the model performs at the task of prediction. In statistical learning, we often use these metrics to help us decide which model to use for a given data problem. However, it is not always possible to directly compute these metrics. Sometimes, we have to estimate.
The process of estimating test metrics often involves cross-validation. Today, we are going to practice some of the code structures that we need to perform the validation approach, the first type of cross-validation that we will learn.
Open up an RMarkdown file, and delete everything after Line 10.
Note: In this lab, I will ask you to show me code in a variety of places. This means you do NOT have to hide all your code.
Let's start!
Today, we are going to work with a data set on the Super Bowl. We have information from all of the Super Bowl games played from 1967 to 2020. Only the 2021 game information is excluded.
Before starting this lab, make sure you have set up your RMarkdown and downloaded the data by watching the videos on Canvas.
The very first step in starting any new Markdown file is to load in the data that you need in a chunk in your Markdown. If you do not load the data inside of a chunk, your Markdown will not Knit.
You will notice that the name of the data set is a little long, which can be annoying to type over and over in R, so let's change it. To do that, we store a copy of the data under the simpler name SuperBowl
, and then remove (rm()
) the original version of the data, using the following two lines of code:
SuperBowl <- The_Big_Game_Stats
rm(The_Big_Game_Stats)
Now, go ahead and open the data. For some folks, there will be a few empty rows at the end. If not, great, and you can move to the next section. If there are empty rows, here is how to fix it!
SuperBowl <- SuperBowl[1:54,]
Our goal for today is to estimate the winning score of a Super Bowl game (Winner_Pts
) based on the number of first downs the winning team had during the game (Winner_FirstDowns
).
library(ggplot2)
We are going to consider two different choices of f: polynomial regression and least squares linear regression (LSLR).
Once we have plotted our two different choices of f, it is helpful to actually write out the estimated \(\hat{f}(X)\).
Doing things like writing down the form of a model often involve parameters. Parameters in statistics are generally represented as Greek letters. How can we write Greek letters and other mathematical notation in our Markdown file so they show up when we knit?
If you want to write mathematical notation, we need to tell Markdown, "Hey, we're going to make a math symbol!" To do that, you use dollar signs. For instance, to make \(\hat{\beta}_1\), you simply put $\hat{\beta}_1$ into the white space (not a chunk) in your Markdown.
Go ahead and do that. See how the dollar signs change colors? Also note that if you hover your mouse over what you just pasted, the mathematical symbol we want will appear.
If you want the symbol to appear on its own line in your Markdown, you need to put two $ signs at the beginning and end of the line (so $$). Try that now.
The same thing works for other mathematical symbols. Let's say I want to write out a LSLR regression line. The code is $\widehat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1 X_i$. We'll notice that we used \hat for the \(\hat{\beta}_1\) but \widehat over \( Y_i \). Why? Because we usually replace \( Y_i \) with a longer word, so it needs a bigger (wider) hat.
A word of caution. You must make sure that both the dollar sign at the beginning and end of your mathematical expression is touching text. In other words, $\hat{y}$ will knit just fine but $\hat{y} $ will yield an error. This is important. Your document will not knit if you forget! Now, you can put spaces inside, meaning that $\hat{y} = 4$ is fine, but the beginning and end can have NO spaces.
Now that we have decided on the two models we are considering, it is time to train, i.e., fit, the models in R.
lm( Y ~ X + I(X^2) + I(X^3), data = )
.Now that we have our two different models, and have used some visualizations to explore what they look like, we want to start to evaluate how well these models might do at our assigned task: prediction.
When our goal is to assess prediction, we generally check to see if we have access to test data, which means data that were not used to train our model that we can use to make predictions and assess our model's ability to make accurate predictions. In this case, we have a very small test data set. Where you ask? Well, the 2021 SuperBowl has just taken place. The score for the winning team was 31, and they had 26 first downs.
predict
function.) State the prediction, and the value of the residual for the 2021 SuperBowl.Now, this test data set is very small. It is only row. We don't really want to assess predictive accuracy based on only one row, as it is possible that that one row is an anomaly, meaning correct data that happens to be unusual.
Also, predictive models are generally used to predict what happens before we know the score, not after. This means that there are situations when we do not have test data that we can use to assess predictive accuracy.
The is the goal of a powerful set of statistical learning procedures known as cross-validation techniques. How do we assess predictive accuracy when we do not have a test data set? Let's work through one such technique, called the validation method.
In the validation method, we create "test" data by stealing some of the rows from our training data. Let's try this out with our Super Bowl data.
So, let's go back to the beginning. We have a data set, and we are told that our goal is prediction. Let's ignore the 2021 Super Bowl for the moment and say we do not have test data. This tells us that performing the validation method might be needed to estimate predictive accuracy. Before we do any model fitting, this means we need to create some test data.
When we create two data sets from one, we run into two problems. The first is that we reduce the sample size in the data we use for model training, which means we may have a less accurate estimation of the model parameters. Because of this, we make sure that when we split the data, more data ends up in the CV training set than in the CV test set.
Note: We are going to be performing the code for the validation method one steps at a time in this lab. At the end, I am going to ask you to do it all at once.
At this point, we want to check two things. First, make sure the number of rows in your CV training and CV test data set add up to the original number of rows in the training data. Second, open up the test set and make sure that the rows of data match up to the row numbers you printed out as your answer to Question 12.
Now that we have our two data sets, it is time to actually train (fit) the two models we are considering.
Okay, so we have divided our original training data, and we have used the CV training data to fit our models. Now, it is time to use our CV test data and make predictions.
predict
; compute it mathematically.predict
!). Store the predictions as an object called predsLSLR
. Show the code you used to do so, and annotate it. Show that the prediction you obtained in the previous question is the first element in the predsLSLR
vector. Note: It is okay if the values are little different due to rounding!predsPoly
. Show the code you used to do so, and annotate it.Great, are we done? Do we choose a model and move on? Well...not quite. At this moment, several of your classmates are working on this same lab, but they may have picked a different random seed from you. Would that matter?
Take your code from Questions 12 - 20 (all the code needed to perform the validation method). Put all of this code in one chunk of code. We usually call a chunk of code with multiple lines of code that work together to perform a task a script. In this case, I am asking you to build a script that can perform the validation method, all in one chunk, so you only have to press play once.
What happened???
What we have just uncovered is a problem of the validation method when we are working with a small data set. This is not a big concern when we have a very large original training set, but here, we only have 54 rows. This means that the estimates of the test RSS and test MSE we obtain from the validation method are entirely dependent upon which rows in the original training set made their way into the CV training set when you drew a random sample.
This means our answer to the question "Which model has higher predictive accuracy?" can change depending on the choice of random seed.
Well, that's not ideal. This means that, especially with small training data sets, the process of using the validation method to estimate the test MSE is what we call a high variance estimation procedure. In other words, the estimate can change quite a bit depending on which data rows ended up in the CV training vs. the CV test data set. Small changes in the data sets can lead to big changes in the estimates.
So is it never okay to use validation? No. With a very large original training set, the validation approach can be very effective and very efficient computationally. However, with smaller data sets, we are going to want to try something a little different.
There are two more methods that we will explore that will have better estimation properties on smaller data sets: LOOCV and k-fold. These two techniques are similar to validation, but require a clever way to create test data. This will be the next concept we cover in class!