STA 363 Lab 4
Complete all Questions and submit your final PDF or html under Assignments in Canvas.
The Goal
We have been learning a lot of code in our quest to explore cross validation techniques. Today, we are going to dig more deeply into the code behind 10-fold CV. We are also going to practice KNN, and combining KNN with 10-fold CV requires a new coding structure called nested for-loops.
The Data
We are going to stick with our penguins data set from last lab. To load the data, put the following lines of code inside a chunk and press play:
library(palmerpenguins)
data("penguins")
penguins <- na.omit(penguins)
# Remove the feature year: we don't want to use it today
penguins <- data.frame(penguins)[,-8]
Our goal for today is to predict \(Y\) = the body mass of the penguin. As we are provided with only one data set, we will need to determine a way to assess the predictive accuracy of any technique we choose. We will do this today using 10-fold CV. 10-fold CV involves dividing our original data set into 10 different subsets called folds. Each fold will contain roughly the same number of rows, and each row from the original data set will be assigned to exactly one fold.
In class, we discussed the folds in terms of tables at a restaurant. We have 10 folds (so 10 tables) and now we need to think about how many penguins need to be assigned to each fold (table).
Question 1
How many rows should be in each fold for our penguins
data? Will all the folds have exactly the same number of rows? Explain
your reasoning.
Once we have determined how many rows (penguins) need to go in to each fold (table), we need to complete the process of actually assigning each row to a fold (penguin to a table). This involves sampling without replacement. Before we try this on the real data set, let’s try a smaller example.
Small Data Example
Let’s create a smaller data set to work with to illustrate the code.
Question 2
Create a data set called penguinsSmall
that contains the
first 20 rows in the penguin data set. If we use 5-fold
CV with this smaller data set, how many rows belong in each
fold?
Assigning Fold Numbers
Assigning rows to folds is like assigning people to tables as they enter a room. We have 20 people who will enter a room, and 5 tables for them to sit at. We want the same number of people at each table. To achieve this, we fill a box with 20 tickets. Each ticket contains a number: 1, 2, 3, 4, or 5. Each person chooses a ticket, and the number on the ticket tells the person which table to sit at.
Question 3
For our 5-fold CV with penguinsSmall
, how many tickets
in this box need to contain the number 4?
Now, in R we aren’t actually creating tickets, but that is the idea. To do this in R, we are going to create a vector rather than a box to hold our tickets. To do this, use the following:
The command rep
in R means “repeat”. So, this command
means “Repeat the numbers 1 to 5 four times”. Take a look at what is in
the vector.
Question 4
Suppose we have a data set with 500 rows, and we want to use 5-fold
CV. What code would we need to create the equivalent of the
tickets
vector for this new data set? Call your vector
tickets_Q4
.
Now that we have we have created our tickets, the next step is to have each person (each row) draw a ticket (a fold number) as they enter the restaurant. To do this, we will use sampling without replacement.
# Set the random seed
set.seed(363663)
# Draw tickets from the box
foldsSmall <- sample(tickets, 20, replace = FALSE)
Recall that with the sample
command, the command
structure is sample(What to Sample From, How Many to Sample, Do we
sample with or without replacement). We are sampling from the box of
tickets, 20 people are drawing out a ticket, and they do not put their
ticket back once they draw it.
The vector foldsSmall
that we have just created
indicates which row in our 20 row data set belongs to each of the
5-folds. The first number in foldsSmall
tells us the fold
that the first row of the data is assigned to, and so on.
Question 5
Which fold is the 5th row in the data set assigned to? What about the 20th row?
At this point, we have determined which rows in the
penguinsSmall
data set belong to each fold, but we have not
actually created the separate data sets defined by each fold. In other
words, each penguin is holding a ticket so they know which table number
they need to go to, but we have not actually moved the penguins to their
tables. Let’s do that now.
Moving rows into folds (Creating the separate data sets)
We have already created the vector foldsSmall
which
indicates the folds each row is assigned to. Now, let’s actually figure
out which rows are in each fold.
We are going to start with fold 1. Which rows in
penguinsSmall
are in fold 1? To determine this, we use the
following code:
In R, ==
looks to see if two things match. So,
2 ==3
will return FALSE
and 2==2
will return TRUE
.
Question 6
Which rows in the penguinsSmall
data set are in fold
1?
Question 7
Which rows in the penguinsSmall
data set are in fold
2?
Now that we know which rows are in a fold, we want to actually create
a data set that contains those rows. To create fold 1, this involves
taking all the rows from penguinsSmall
that are in fold 1
and storing them as a data set called fold1
.
Question 8
Create fold 2 (meaning show the fold 2 data set) for
penguinsSmall
and print out the fold. Check to make sure
the row numbers match what you have in Question 7!
Okay, so now we can actually create the folds we need! Let’s go back our larger data set.
Back to the Penguin Data
Okay, now that we have seen an example of how to assign the rows in a data set to folds, let’s try it with all \(n=333\) rows in the penguins data set using 10-fold CV.
Question 9
Use R to assign all 333 rows in the penguins
data set to
10 folds, and store them in a vector called folds
. Use the
same seed as we did in the previous section (363663). Which fold is the
1st row in the penguins data set assigned to? What about the 2nd
row?
Question 10
Create fold 10 (meaning show the fold 10 data set) for
penguins
.
10-fold CV: Running the Loop
Now that we know how to create the folds, let’s move on to the process of running the loop. We need to:
- Set f = 1
- Set fold f as validation data and the rest of the folds as training data
- Build a linear regression model to predict body mass using all other columns as features.
- Store the predictions for the rows in the validation data.
- And repeat!
This should sound like a for loop, because it is one! We also need one step before we run the loop, which is creating our storage space:
One of the tricks to making a for loop is to write the inside first. In other words, write the code that would give the results for f = 1. Once we have done that, it is easier to figure out what parts of the code need to change with f and create the loop. Let’s try it.
Question 11
Code and run Step 1 - 4 above for f=1. Show your code. Which rows in
your predictions
are currently filled in? Does it make
sense that these rows are filled in? Explain.
Hint: which( is.na(predictions)==FALSE)
might help you
find the rows that are filled in.
Building the inside first makes it easier to catch any errors. This in essence shows us what each iteration of the loop will do. Once it does what we want, we can write the loop!
Question 12
Create a for loop to run 10-fold CV for the penguins
data. State the validation RMSE.
Reminder: You need to put this function in a chunk and press play before you can compute the RMSE.
KNN
Now that we have seen how the process words, let’s try another technique to predict body mass: KNN.
Look at Row 1 in the penguins data and Row 53 in the penguins data.
1 | 53 | |
---|---|---|
species | Adelie | Adelie |
island | Torgersen | Biscoe |
bill_length_mm | 39.1 | 36.5 |
bill_depth_mm | 18.7 | 16.6 |
flipper_length_mm | 181 | 181 |
body_mass_g | 3750 | 2850 |
sex | male | female |
We want to use the Gower’s Distance to find the distance between these two rows. Since we have two variables that are categorical with more than two levels, our code will actually “see” a slightly different version of the two rows.
1 | 53 | |
---|---|---|
speciesChinstrap | 0.0 | 0.0 |
speciesGentoo | 0.0 | 0.0 |
islandDream | 0.0 | 0.0 |
islandTorgersen | 1.0 | 0.0 |
bill_length_mm | 39.1 | 36.5 |
bill_depth_mm | 18.7 | 16.6 |
flipper_length_mm | 181.0 | 181.0 |
sexmale | 1.0 | 0.0 |
Question 13
Using the data shown above, find the Gower Distance between Row 1 and Row 53.
Note: It is good practice to compute this by hand.
We are going to use 10-fold CV to estimate the test RMSE of 5-NN (5-nearest neighbors). Before running KNN, we need to load the StatMatch library and teach R the function to compute Gower’s Distance.
knnGower <- function(trainX, testX, trainY, K){
# Convert the data
trainX <- model.matrix(trainY~ ., data = trainX)[,-1]
holder <- 1:nrow(testX)
testX <- model.matrix(holder~ ., data = testX)[,-1]
# Find the Gower Distance
gowerHold <- StatMatch::gower.dist( testX, trainX)
# For each row, find the k smallest
neighbors <- apply(gowerHold, 1, function(x) which(x %in% sort(x)[1:K]))
if(class(neighbors)[1]=="integer"){
preds <- trainY[neighbors]
}
# Take the mean to get the prediction
if(class(neighbors)[1]=="matrix"){
preds <- apply(neighbors, 2,function(x) mean(trainY[x]))
}
if(class(neighbors)[1]=="list"){
preds <- lapply(neighbors, function(x) mean(trainY[x]))
preds <- unlist(preds)
}
# Return the predictions
unlist(preds)
}
Recall that to use this function, the structure is:
We want to use 10-fold CV to assess the predictive accuracy of this approach. Our steps for the 10-fold CV with KNN are:
- Set f = 1
- Set fold f as validation data and the rest of the folds as training data
- Use 5-NN to make predictions for the rows in the validation data.
- Store the predictions
- And repeat!
Question 14
Using 5-NN and \(k = 10\) folds (yes, I know the notation is annoying, but we’re stuck with it!), write a for loop to use 5-NN to make predictions for body mass. Show your code.
Question 15
State the validation RMSE for 5-NN. Is it larger or smaller than what you got using linear regression?
Considering K
In the work we have done so far, we have used 5-NN. However, is that the best choice for the number of neighbors? Could we end up with a smaller validation RMSE if we choose a different number of neighbors \(K\)?
To determine this, we actually need to try out different options of \(K\). This means not only do we need a for loop that loops over the folds, we also need one that allows us to explore different choices of \(K\). The code structure we need for this is called a Nested for-loop.
Nested For Loops
A Nested for-loop looks like this:
Let’s suppose for each number 1 - 5 I want to add the numbers 1 - 10. To do this using a nested for-loop we use the following:
Look over this code and make sure you know exactly what it is doing. If you have any questions, make sure you ask before you move to the next piece of the lab!!
Nested For Loop with KNN
We are now going to use a nested for-loop to determine what value of \(K\) we should choose for KNN. This basically means that we want to take our code from Question 14, which uses a for loop to use 10-fold with 5-NN, and now we want to use a 2nd for loop to allow the number of neighbors to change! Generally, I start with the inner loop and then think about the outer loop as wrapping around that inner loop and repeating it a lot of times.
Typically, the trickiest part of nested for-loops is deciding how to store the results that you need. Sometimes, there are things that you need to store inside the inner loop that do not actually need to be output at the end of the outer loop.
For instance, in our case we need to perform 10-fold CV in the inner loop, which means we need to create a vector to store all of our predicted values of \(Y\). This is not something we need at the end of the outer loop - we only need the validation RMSE for each choice of \(K\) in the outer loop.
This means we need to create two data frames:
# Create a storage space to hold the value of K and the
# validation RMSE for that choice of K
DATAFRAME 1 HERE
# Outer Loop
for( k in ){
# Create a storage space to hold the predictions from
# 10-fold CV
DATAFRAME 2 HERE
# Inner Loop
for( f in ){
}
}
Question 16
Fill in the DATAFRAME HERE parts of the code above, as well as the
empty for()
parts.
Those are the hardest parts, but now, it’s time to fill in the rest.
Question 17
Using Question 16 as a starting point, write a code that for \(K = 3, \dots, 29\) estimates the test RMSE using 10 fold-CV and stores the result. Print out the data frame you get at the end of the outer loop and state which value of \(K\) you would recommend.
Turning in your assignment
When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2024 September 22.
The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .