STA 363 Lab 4
Complete all Questions and submit your final PDF or html under Assignments in Canvas.
The Goal
We have been learning a lot of code in our quest to explore cross validation techniques. Today, we are going to dig more deeply into the code behind 10-fold CV.
The Data
We are going to stick with our penguins data set from last lab. To load the data, put the following lines of code inside a chunk and press play:
library(palmerpenguins)
data("penguins")
penguins <- na.omit(penguins)
# Remove the feature year: we don't want to use it today
penguins <- data.frame(penguins)[,-8]
Our goal for today is to predict \(Y\) = the body mass of the penguin. As we are provided with only one data set, we will need to determine a way to assess the predictive accuracy of any technique we choose. We will do this today using 10-fold CV. 10-fold CV involves dividing our original data set into 10 different subsets called folds. Each fold will contain roughly the same number of rows, and each row from the original data set will be assigned to exactly one fold.
In class, we discussed the folds in terms of tables at a restaurant. We have 10 folds (so 10 tables) and now we need to think about how many penguins need to be assigned to each fold (table).
Question 1
How many rows should be in each fold for our penguins
data? Will all the folds have exactly the same number of rows? Explain
your reasoning.
Once we have determined how many rows (penguins) need to go in to each fold (table), we need to complete the process of actually assigning each row to a fold (penguin to a table). This involves sampling without replacement. Before we try this on the real data set, let’s try a smaller example.
Small Data Example
Let’s create a smaller data set to work with to illustrate the code.
Question 2
Create a data set called penguinsSmall
that contains the
first 20 rows in the penguin data set. If we use 5-fold
CV with this smaller data set, how many rows belong in each
fold?
Assigning Fold Numbers
Assigning rows to folds is like assigning people to tables as they enter a room. We have 20 people who will enter a room, and 5 tables for them to sit at. We want the same number of people at each table. To achieve this, we fill a box with 20 tickets. Each ticket contains a number: 1, 2, 3, 4, or 5. Each person chooses a ticket, and the number on the ticket tells the person which table to sit at.
Question 3
For our 5-fold CV with penguinsSmall
, how many tickets
in this box need to contain the number 4?
Now, in R we aren’t actually creating tickets, but that is the idea. To do this in R, we are going to create a vector rather than a box to hold our tickets. To do this, use the following:
The command rep
in R means “repeat”. So, this command
means “Repeat the numbers 1 to 5 four times”. Take a look at what is in
the vector.
Question 4
Suppose we have a data set with 500 rows, and we want to use 5-fold
CV. What code would we need to create the equivalent of the
tickets
vector for this new data set? Call your vector
tickets_Q4
.
Now that we have we have created our tickets, the next step is to have each person (each row) draw a ticket (a fold number) as they enter the restaurant. To do this, we will use sampling without replacement.
# Set the random seed
set.seed(363663)
# Draw tickets from the box
foldsSmall <- sample(tickets, 20, replace = FALSE)
Recall that with the sample
command, the command
structure is sample(What to Sample From, How Many to Sample, Do we
sample with or without replacement). We are sampling from the box of
tickets, 20 people are drawing out a ticket, and they do not put their
ticket back once they draw it.
The vector foldsSmall
that we have just created
indicates which row in our 20 row data set belongs to each of the
5-folds. The first number in foldsSmall
tells us the fold
that the first row of the data is assigned to, and so on.
Question 5
Which fold is the 5th row in the data set assigned to? What about the 20th row?
At this point, we have determined which rows in the
penguinsSmall
data set belong to each fold, but we have not
actually created the separate data sets defined by each fold. In other
words, each penguin is holding a ticket so they know which table number
they need to go to, but we have not actually moved the penguins to their
tables. Let’s do that now.
Moving rows into folds (Creating the separate data sets)
We have already created the vector foldsSmall
which
indicates the folds each row is assigned to. Now, let’s actually figure
out which rows are in each fold.
We are going to start with fold 1. Which rows in
penguinsSmall
are in fold 1? To determine this, we use the
following code:
In R, ==
looks to see if two things match. So,
2 ==3
will return FALSE
and 2==2
will return TRUE
.
Question 6
Which rows in the penguinsSmall
data set are in fold
1?
Question 7
Which rows in the penguinsSmall
data set are in fold
2?
Now that we know which rows are in a fold, we want to actually create
a data set that contains those rows. To create fold 1, this involves
taking all the rows from penguinsSmall
that are in fold 1
and storing them as a data set called fold1
.
Question 8
Create fold 2 (meaning show the fold 2 data set) for
penguinsSmall
and print out the fold. Check to make sure
the row numbers match what you have in Question 7!
Okay, so now we can actually create the folds we need! Let’s go back our larger data set.
Back to the Penguin Data
Okay, now that we have seen an example of how to assign the rows in a data set to folds, let’s try it with all \(n=333\) rows in the penguins data set using 10-fold CV.
Question 9
Use R to assign all 333 rows in the penguins
data set to
10 folds, and store them in a vector called folds
. Use the
same seed as we did in the previous section (363663). Which fold is the
1st row in the penguins data set assigned to? What about the 2nd
row?
Question 10
Create fold 10 (meaning show the fold 10 data set) for
penguins
.
10-fold CV: Running the Loop
Now that we know how to create the folds, let’s move on to the process of running the loop. We need to:
- Set f = 1
- Set fold f as validation data and the rest of the folds as training data
- Build a linear regression model to predict body mass using all other columns as features.
- Store the predictions for the rows in the validation data.
- And repeat!
This should sound like a for loop, because it is one! We also need one step before we run the loop, which is creating our storage space:
One of the tricks to making a for loop is to write the inside first. In other words, write the code that would give the results for f = 1. Once we have done that, it is easier to figure out what parts of the code need to change with f and create the loop. Let’s try it.
Question 11
Code and run Step 1 - 4 above for f=1. Show your code. Which rows in
your predictions
are currently filled in? Does it make
sense that these rows are filled in? Explain.
Hint: which( is.na(predictions)==FALSE)
might help you
find the rows that are filled in.
Building the inside first makes it easier to catch any errors. This in essence shows us what each iteration of the loop will do. Once it does what we want, we can write the loop!
Question 12
Create a for loop to run 10-fold CV for the penguins
data. State the validation RMSE.
Reminder: You need to put this function in a chunk and press play before you can compute the RMSE.
KNN
Now that we have seen how the process words, let’s try another technique to predict body mass: KNN.
Look at Row 1 in the penguins data and Row 53 in the penguins data.
1 | 53 | |
---|---|---|
species | Adelie | Adelie |
island | Torgersen | Biscoe |
bill_length_mm | 39.1 | 36.5 |
bill_depth_mm | 18.7 | 16.6 |
flipper_length_mm | 181 | 181 |
body_mass_g | 3750 | 2850 |
sex | male | female |
We want to use the Gower’s Distance to find the distance between these two rows. Since we have two variables that are categorical with more than two levels, our code will actually “see” a slightly different version of the two rows.
1 | 53 | |
---|---|---|
speciesChinstrap | 0.0 | 0.0 |
speciesGentoo | 0.0 | 0.0 |
islandDream | 0.0 | 0.0 |
islandTorgersen | 1.0 | 0.0 |
bill_length_mm | 39.1 | 36.5 |
bill_depth_mm | 18.7 | 16.6 |
flipper_length_mm | 181.0 | 181.0 |
sexmale | 1.0 | 0.0 |
Question 13
Using the data shown above, find the Gower Distance between Row 1 and Row 53.
Note: It is good practice to compute this by hand.
We are going to use 10-fold CV to estimate the test RMSE of 5-NN (5-nearest neighbors). Before running KNN, we need to load the StatMatch library and teach R the function to compute Gower’s Distance.
knnGower <- function(trainX, testX, trainY, K){
# Convert the data
trainX <- model.matrix(trainY~ ., data = trainX)[,-1]
holder <- 1:nrow(testX)
testX <- model.matrix(holder~ ., data = testX)[,-1]
# Find the Gower Distance
gowerHold <- StatMatch::gower.dist( testX, trainX)
# For each row, find the k smallest
neighbors <- apply(gowerHold, 1, function(x) which(x %in% sort(x)[1:K]))
if(class(neighbors)[1]=="integer"){
preds <- trainY[neighbors]
}
# Take the mean to get the prediction
if(class(neighbors)[1]=="matrix"){
preds <- apply(neighbors, 2,function(x) mean(trainY[x]))
}
if(class(neighbors)[1]=="list"){
preds <- lapply(neighbors, function(x) mean(trainY[x]))
preds <- unlist(preds)
}
# Return the predictions
unlist(preds)
}
Recall that to use this function, the structure is:
We want to use 10-fold CV to assess the predictive accuracy of this approach. Our steps for the 10-fold CV with KNN are:
- Set f = 1
- Set fold f as validation data and the rest of the folds as training data
- Use 5-NN to make predictions for the rows in the validation data.
- Store the predictions
- And repeat!
Question 14
Using 5-NN and \(k = 10\) folds (yes, I know the notation is annoying, but we’re stuck with it!), write a for loop to use 5-NN to make predictions for body mass. Show your code.
Question 15
State the validation RMSE for 5-NN. Is it larger or smaller than what you got using linear regression?
Considering K
In the work we have done so far, we have used 5-NN. However, is that the best choice for the number of neighbors? Could we end up with a smaller validation RMSE if we choose a different number of neighbors \(K\)?
To determine this, we actually need to try out different options of \(K\). This means not only do we need a for loop that loops over the folds, we also need one that allows us to explore different choices of \(K\). The code structure we need for this is called a Nested for-loop. You will work with this for your Project 1.
Turning in your assignment
When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2024 September 22.
The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .