STA 363 Lab 3
Complete all Questions and submit your final html or PDF under Assignments in Canvas.
The Goal
We have started talking about prediction tasks, and one of the things we have discussed is how important it is to have test data to use to assess the predictive accuracy of our model. However, we are often in situations where we are given only one data set, rather than both a test data set and a training data set. What do we do then??
In our lab today, we will explore two different techniques for how we might approach this problem.
The Data
We are going to return to our penguin data set from Lab 1. It’s easy to load, and it has some nice linear relationships we can use today.
To load the data, put the following two lines of code inside a chunk and press play:
As a reminder, this data set contains information on n = 344 penguins. However, there is some missing data. We will begin by removing the rows with missing data, which results in a data set with \(n=333\) penguins.
Yes, I know, we know a lot of techniques for handling missing data now!! However, that is not the focus for today so we are going to proceed with CCA.
Our goal for today is to predict \(Y=\) the body mass of a penguin in grams using all other columns in the data set as features. This means we have a regression, predictive task.
As we discussed in class, when we have a predictive task we need two data sets: a training data set and a testing data set. However, today we only have one data set. Because of this, we are going to explore two different ways to create validation data so we can assess the predictive accuracy of our model.
Option 1: Splitting the Data
Since we only have one data set, but we need two, one option we may
consider is breaking our one data set into two. To do this, we are going
to take 80% of the rows at random from the penguins
data
set and treat those rows training data. The remaining 20% of the rows
will be treated as validation data.
Question 1
Why is it important to make the training set bigger than the validation set?
Question 2
Using an 80/20 split, how many rows will we be treated as training data?
Note: If you have a value that includes decimals, we use the floor function to get a whole number. This means that we round down to the nearest whole number.
Okay, so we know how many rows that we need to treat as training data. How do we randomly sample that many rows using R?
Using Random Seeds
To randomly sample \(n_{train}\) rows from our original \(n\) rows, where \(n_{train} = \lfloor(n \times .8)\rfloor\), we use the following:
The code chunk above has two lines. The first line of code,
set.seed(100)
, is one we will use quite a bit in this
course. We are going to use random sampling to choose the rows for our
training data. This means that we are going to ask the computer to
randomly sample 266 of our 333 rows.
However, suppose you close your Markdown file and come back to it
later. We want the computer to choose the SAME 266 rows when you run
your code again. Otherwise, you would get completely different results
each time you drew a random sample! The set.seed()
function
is what ensures that each time you run your chunk, you will get the same
random sample. Let’s try that.
Question 3
Create a code chunk. Use the sample
command to randomly
sample 2 numbers between 1 and 10 (without replacement). Hit play on the
chunk. What numbers do you get? Now, hit play again. What numbers do you
have now?
You should notice that every time you hit play on this chunk, you get a different sample. This is what would happen if you closed your Markdown and re-opened it, or if you gave your code to someone else to run. This is not something we want to have happen, so we set a random seed to fix this problem.
Question 4
Now, add the line set.seed(435)
to the beginning of your
code chunk from the previous question (meaning this line needs to come
before the sample
command.) You will note that I used 435,
but you can use literally any positive integer you want as your random
seed. Hit play on the chunk. What numbers do you get? Now, hit play
again. What numbers do you have now?
Now we notice that no matter how often we run the chunk, we get the same values. Yes!! This means that we can close our R and come back to it later, and our results will not change. This also means that we can send our code to someone else, and they will get the same random sample that we did. This means that setting a seed can help make your code reproducible.
Setting a random seed (which is what we will call using the
set.seed()
command) will prove very useful for any kind of
random sampling we do in this course.
Sampling in R
Now, let’s go back to the code chunk we started with:
We have discussed the first line, but what does the second line do?
Well, sample(1:333, 333*.8, replace = FALSE)
will draw 266
random numbers between 1 and 333 (1:333
). The form of the
code is
sample( What options am I sampling from?,
How many values do I want to sample?,
Do I want to sample with replacement?)
Question 5
Show the code you would need to sample 5 random numbers between 5 and 678 without replacement.
Once we understand what a code is doing, the next step is to
understand the type of output the function produces.
Once we have drawn our 266 random numbers, we store that output in a
vector called trainRows
.
Question 6
Set a random seed of 245. Show the code you would need to sample 5
random numbers between 5 and 200 without replacement, and store the
results as trainPractice
. Which 5 random numbers did you
select?
Okay, so right now trainRows
is a vector that contains
266 numbers between 1:333. We don’t want a list of numbers; we want a
training data set. This means that the next step is to go into the
training data set and grab the rows specified in
trainRows
.
Question 7
Show and run the code you would need to create a data set called
newTrain
which contains the rows in the
penguins
data set specified by trainRows
.
At this point, we have created our new training data. This means that 266 of the original 333 rows have been used. What do we do with the 67 rows that were not chosen for training data? The rows that were not selected for our new training data set will form our validation data set.
Question 8
Show and run the code you would need to create a data set called
validation
which contains all the rows in the
penguins
data set which are NOT specified by
trainRows
.
Before moving on, check to make sure newTrain
has 266
rows and validation
has 67 rows.
Now that we have our validation data and training data, we have
created a few objects we no longer need. To clean up our work space in
R, we can remove these objects if we wish. To do so, we use the code
rm
:
Using the New Data
Now that we have successfully split our data into a validation set and a new training set, let’s use them!
Question 9
Using the new training data, train a linear regression model for \(Y\) = body mass using all other columns as features. Use this trained model to make predictions for body mass for all the penguins in the validation data. State the validation RMSE of your model.
Hint: You need to put this function in a chunk and press play as part of this.
Great!!! Are we done?? Can we report this as our predictive accuracy??
It turns out that we shouldn’t. There is one issue with how we have created our validation data.
Question 10
Using a new random seed (363663), repeat the process of splitting the data into validation and training data, training the model, and finding the validation RMSE. State the value of the validation RMSE you get.
Hmm…this isn’t the same number we got in Question 9. Is this a fluke?
Question 11
Write a for loop that uses 10 different seeds to (1) split the penguins data into validation and training, (2) train the model on the new training data, and (3) print out the validation RMSE. In other words, write a for loop that does Question 10 for 10 different seeds. Show your code and run your code.
It turns out that with a data set that this is this small, splitting the data into two data sets creates what is called high variance estimation. This means that our estimate of the validation RMSE is highly dependent on the split of the data into the new training and validation data sets, which means our validation RSME estimate is highly dependent on the random seeds that we choose. This is not ideal; we don’t want our results to be left up to random chance. Because of this, we are going to try a different technique to create validation data.
Option 2: LOOCV
Splitting the data into two different data sets is not the only way to create validation data. Another option is called leave-one-out cross validation, or LOOCV. In this technique, we don’t actually create just one validation set and one training set. Instead, the method works on the idea of allowing each row in the original data set to be used as validation data. We train the model using all other rows of data, and then we make a prediction for the hold-out row (the validation row). We then repeat the process for every row in the data set, creating a vector \(\hat{Y}\) with predictions for every row in the original data set. Because the rows we treated as validation data were all in the original data set, we know what the value of \(Y\) should be for each row so we can compute the validation RMSE!
This all means that first step in LOOCV is to create a data set that will hold our predicted values of body mass.
# Store the number of rows in the penguins data set
n <- nrow(penguins)
# Create the data set
YHat <- data.frame( "body_mass_g" = rep(NA,n) )
Once you have run the code above, look in your Environment tab (upper right corner of RStudio).
Question 12
How many rows and how many columns does YHat
have?
As go through the LOOCV process, we will fill in this data set with the predicted values of body mass for each penguin when that penguin is treated as validation data.
To get a feel for this process, let’s start with just running the process on the first row in the penguins data set.
First Iteration
The first step in LOOCV is to remove the first row from the data set and treat it as validation data.
We then treat the rest of the data as training data.
Remember that -
means “except” or “excluding”, so this
code means that all the rows except row 1 are assigned to the new
training set.
Question 13
Using the new training data, train a linear regression model for
\(Y\) = body mass using all other
columns as features. Call this model LOOCV_m1
. Using your
trained model, predict the body mass \(\hat{y}_1\) of the penguin in the
validation data. State the predicted body mass.
This is the predicted body mass for the first penguin, so we want to
store this predicted value in the first row of YHat
.
Question 14
Run and show the code you would need to store the value of \(\hat{y}_1\) in the first row of
YHat
.
I would strongly recommend that at this point you open the
YHat
data set and make sure that the first row is filled in
with the number you expect, and the rest of the rows are NAs.
Now that we have our prediction, we recall that the penguin in our
validation set was actually the first row in the penguins
data set, so we know the true body mass for this penguin.
Question 15
What is the residual \(y_1 - \hat{y}_1\) for this penguin?
For Loop
At this point, we have completed just one iteration
of LOOCV. In other words, we have run the process for just the first row
in the data set. Now, we are ready to perform the process on all \(n = 333\) rows in the original
penguins
data set.
Question 16
Write a for loop to perform LOOCV and fill in YHat
. Show
the for loop and print out the first 5 rows of YHat
.
At this point it is a good idea to look over YHat
to
make sure there is no missing data, meaning that you have filled in
every row. I usually use the summary command for this:
Now we have a data set containing a predicted value of body mass for
every penguin in the penguins
data set. We also know the
true body mass for every penguin in the penguins
data set.
This means that we can compute the validation RMSE.
Question 17
Compute the validation RMSE for LOOCV. State the value you get.
Note: I know that YHat
only has one
column, but for the RMSE code to run properly, you need to use
YHat[, "body_mass_g"]
or YHat[,1]
to pull the
predicted values. If you do not do this, your answer will not be
correct!!
Next Steps
LOOCV is a very powerful tool for allowing us to assess predictive accuracy. However, it has a drawback. If you use LOOCV on large data sets, the process is computationally expensive, meaning it is slow and takes a lot of computer memory.
So, it sounds like LOOCV is good for small to medium data sets, and creating validation data by splitting the data is good for very large data sets (like hundreds of thousands of rows). Isn’t there something in the middle we can use?? Yep. We will learn k-fold CV next class.
References
The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2025 February 6.