STA 363 Lab 3

Complete all Questions and submit your final html or PDF under Assignments in Canvas.

The Goal

We have started talking about prediction tasks, and one of the things we have discussed is how important it is to have test data to use to assess the predictive accuracy of our model. However, we are often in situations where we are given only one data set, rather than both a test data set and a training data set. What do we do then??

In our lab today, we will explore two different techniques for how we might approach this problem.

The Data

We are going to return to our penguin data set from Lab 1. It’s easy to load, and it has some nice linear relationships we can use today.

To load the data, put the following two lines of code inside a chunk and press play:

library(palmerpenguins)
data("penguins")
penguins <- data.frame(penguins)

As a reminder, this data set contains information on n = 344 penguins. However, there is some missing data. We will begin by removing the rows with missing data, which results in a data set with \(n=333\) penguins.

penguins <- na.omit(penguins)

Yes, I know, we know a lot of techniques for handling missing data now!! However, that is not the focus for today so we are going to proceed with CCA.

Our goal for today is to predict \(Y=\) the body mass of a penguin in grams using all other columns in the data set as features. This means we have a regression, predictive task.

As we discussed in class, when we have a predictive task we need two data sets: a training data set and a testing data set. However, today we only have one data set. Because of this, we are going to explore two different ways to create validation data so we can assess the predictive accuracy of our model.

Option 1: Splitting the Data

Since we only have one data set, but we need two, one option we may consider is breaking our one data set into two. To do this, we are going to take 80% of the rows at random from the penguins data set and treat those rows training data. The remaining 20% of the rows will be treated as validation data.

Question 1

Why is it important to make the training set bigger than the validation set?

Question 2

Using an 80/20 split, how many rows will we be treated as training data?

Note: If you have a value that includes decimals, we use the floor function to get a whole number. This means that we round down to the nearest whole number.

Okay, so we know how many rows that we need to treat as training data. How do we randomly sample that many rows using R?

Using Random Seeds

To randomly sample \(n_{train}\) rows from our original \(n\) rows, where \(n_{train} = \lfloor(n \times .8)\rfloor\), we use the following:

set.seed(100)
trainRows <- sample(1:333, 333*.8 ,replace = FALSE)

The code chunk above has two lines. The first line of code, set.seed(100), is one we will use quite a bit in this course. We are going to use random sampling to choose the rows for our training data. This means that we are going to ask the computer to randomly sample 266 of our 333 rows.

However, suppose you close your Markdown file and come back to it later. We want the computer to choose the SAME 266 rows when you run your code again. Otherwise, you would get completely different results each time you drew a random sample! The set.seed() function is what ensures that each time you run your chunk, you will get the same random sample. Let’s try that.

Question 3

Create a code chunk. Use the sample command to randomly sample 2 numbers between 1 and 10 (without replacement). Hit play on the chunk. What numbers do you get? Now, hit play again. What numbers do you have now?

You should notice that every time you hit play on this chunk, you get a different sample. This is what would happen if you closed your Markdown and re-opened it, or if you gave your code to someone else to run. This is not something we want to have happen, so we set a random seed to fix this problem.

Question 4

Now, add the line set.seed(435) to the beginning of your code chunk from the previous question (meaning this line needs to come before the sample command.) You will note that I used 435, but you can use literally any positive integer you want as your random seed. Hit play on the chunk. What numbers do you get? Now, hit play again. What numbers do you have now?

Now we notice that no matter how often we run the chunk, we get the same values. Yes!! This means that we can close our R and come back to it later, and our results will not change. This also means that we can send our code to someone else, and they will get the same random sample that we did. This means that setting a seed can help make your code reproducible.

Setting a random seed (which is what we will call using the set.seed() command) will prove very useful for any kind of random sampling we do in this course.

Sampling in R

Now, let’s go back to the code chunk we started with:

set.seed(100)
trainRows <- sample(1:333, 333*.8 ,replace = FALSE)

We have discussed the first line, but what does the second line do? Well, sample(1:333, 333*.8, replace = FALSE) will draw 266 random numbers between 1 and 333 (1:333). The form of the code is

sample( What options am I sampling from?,
        How many values do I want to sample?,
        Do I want to sample with replacement?)

Question 5

Show the code you would need to sample 5 random numbers between 5 and 678 without replacement.

Once we understand what a code is doing, the next step is to understand the type of output the function produces. Once we have drawn our 266 random numbers, we store that output in a vector called trainRows.

trainRows

Question 6

Set a random seed of 245. Show the code you would need to sample 5 random numbers between 5 and 200 without replacement, and store the results as trainPractice. Which 5 random numbers did you select?

Okay, so right now trainRows is a vector that contains 266 numbers between 1:333. We don’t want a list of numbers; we want a training data set. This means that the next step is to go into the training data set and grab the rows specified in trainRows.

Question 7

Show and run the code you would need to create a data set called newTrain which contains the rows in the penguins data set specified by trainRows.

At this point, we have created our new training data. This means that 266 of the original 333 rows have been used. What do we do with the 67 rows that were not chosen for training data? The rows that were not selected for our new training data set will form our validation data set.

Question 8

Show and run the code you would need to create a data set called validation which contains all the rows in the penguins data set which are NOT specified by trainRows.

Before moving on, check to make sure newTrain has 266 rows and validation has 67 rows.

Now that we have our validation data and training data, we have created a few objects we no longer need. To clean up our work space in R, we can remove these objects if we wish. To do so, we use the code rm:

rm(trainRows,trainPractice, penguins_raw)

Using the New Data

Now that we have successfully split our data into a validation set and a new training set, let’s use them!

Question 9

Using the new training data, train a linear regression model for \(Y\) = body mass using all other columns as features. Use this trained model to make predictions for body mass for all the penguins in the validation data. State the validation RMSE of your model.

Hint: You need to put this function in a chunk and press play as part of this.

compute_RMSE <- function(truth, predictions){

  # Part 1
  part1 <- truth - predictions
  
  #Part 2
  part2 <- part1^2
  
  #Part 3: RSS 
  part3 <- sum(part2)
  
  #Part4: MSE
  part4 <- 1/length(predictions)*part3
  
  # Part 5:RMSE
  sqrt(part4)
}

Great!!! Are we done?? Can we report this as our predictive accuracy??

It turns out that we shouldn’t. There is one issue with how we have created our validation data.

Question 10

Using a new random seed (363663), repeat the process of splitting the data into validation and training data, training the model, and finding the validation RMSE. State the value of the validation RMSE you get.

Hmm…this isn’t the same number we got in Question 9. Is this a fluke?

Question 11

Write a for loop that uses 10 different seeds to (1) split the penguins data into validation and training, (2) train the model on the new training data, and (3) print out the validation RMSE. In other words, write a for loop that does Question 10 for 10 different seeds. Show your code and run your code.

It turns out that with a data set that this is this small, splitting the data into two data sets creates what is called high variance estimation. This means that our estimate of the validation RMSE is highly dependent on the split of the data into the new training and validation data sets, which means our validation RSME estimate is highly dependent on the random seeds that we choose. This is not ideal; we don’t want our results to be left up to random chance. Because of this, we are going to try a different technique to create validation data.

Option 2: LOOCV

Splitting the data into two different data sets is not the only way to create validation data. Another option is called leave-one-out cross validation, or LOOCV. In this technique, we don’t actually create just one validation set and one training set. Instead, the method works on the idea of allowing each row in the original data set to be used as validation data. We train the model using all other rows of data, and then we make a prediction for the hold-out row (the validation row). We then repeat the process for every row in the data set, creating a vector \(\hat{Y}\) with predictions for every row in the original data set. Because the rows we treated as validation data were all in the original data set, we know what the value of \(Y\) should be for each row so we can compute the validation RMSE!

This all means that first step in LOOCV is to create a data set that will hold our predicted values of body mass.

# Store the number of rows in the penguins data set
n <- nrow(penguins)

# Create the data set 
YHat <- data.frame( "body_mass_g" = rep(NA,n) )

Once you have run the code above, look in your Environment tab (upper right corner of RStudio).

Question 12

How many rows and how many columns does YHat have?

As go through the LOOCV process, we will fill in this data set with the predicted values of body mass for each penguin when that penguin is treated as validation data.

To get a feel for this process, let’s start with just running the process on the first row in the penguins data set.

First Iteration

The first step in LOOCV is to remove the first row from the data set and treat it as validation data.

validation <- penguins[1, ]

We then treat the rest of the data as training data.

newTrain <- penguins[-1, ]

Remember that - means “except” or “excluding”, so this code means that all the rows except row 1 are assigned to the new training set.

Question 13

Using the new training data, train a linear regression model for \(Y\) = body mass using all other columns as features. Call this model LOOCV_m1. Using your trained model, predict the body mass \(\hat{y}_1\) of the penguin in the validation data. State the predicted body mass.

This is the predicted body mass for the first penguin, so we want to store this predicted value in the first row of YHat.

Question 14

Run and show the code you would need to store the value of \(\hat{y}_1\) in the first row of YHat.

I would strongly recommend that at this point you open the YHat data set and make sure that the first row is filled in with the number you expect, and the rest of the rows are NAs.

Now that we have our prediction, we recall that the penguin in our validation set was actually the first row in the penguins data set, so we know the true body mass for this penguin.

Question 15

What is the residual \(y_1 - \hat{y}_1\) for this penguin?

For Loop

At this point, we have completed just one iteration of LOOCV. In other words, we have run the process for just the first row in the data set. Now, we are ready to perform the process on all \(n = 333\) rows in the original penguins data set.

Question 16

Write a for loop to perform LOOCV and fill in YHat. Show the for loop and print out the first 5 rows of YHat.

At this point it is a good idea to look over YHat to make sure there is no missing data, meaning that you have filled in every row. I usually use the summary command for this:

summary(YHat[ , "body_mass_g"])

Now we have a data set containing a predicted value of body mass for every penguin in the penguins data set. We also know the true body mass for every penguin in the penguins data set. This means that we can compute the validation RMSE.

Question 17

Compute the validation RMSE for LOOCV. State the value you get.

Note: I know that YHat only has one column, but for the RMSE code to run properly, you need to use YHat[, "body_mass_g"] or YHat[,1] to pull the predicted values. If you do not do this, your answer will not be correct!!

Next Steps

LOOCV is a very powerful tool for allowing us to assess predictive accuracy. However, it has a drawback. If you use LOOCV on large data sets, the process is computationally expensive, meaning it is slow and takes a lot of computer memory.

So, it sounds like LOOCV is good for small to medium data sets, and creating validation data by splitting the data is good for very large data sets (like hundreds of thousands of rows). Isn’t there something in the middle we can use?? Yep. We will learn k-fold CV next class.

References

The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 February 6.