STA 363 Lab 4
Complete all Questions and submit your final PDF or html under Assignments in Canvas.
The Goal
We have been learning a lot of code in our quest to explore cross validation techniques. Today, we are going to dig more deeply into the code behind 10-fold cross validation to make sure we understand the steps. This will help when you work on your first project!
The Data
We are going to go back to the data from our very first lab, which involves penguins. To load the data, put the following two lines of code inside a chunk and press play:
library(palmerpenguins)
data("penguins")
As a reminder, this data set contains information on n = 344 penguins. However, there is some missing data, so we will begin by removing the rows with missing data.
<- na.omit(penguins) penguins
Our goal for today is to predict Y = the sex of the penguin based on the flipper length and bill length of that penguin using 5-nearest neighbors. As we are provided with only one data set, we will need to determine a way to assess the predictive accuracy of any technique we choose. We will do this today using 10-fold CV.
10-fold CV: Creating Folds
10-fold CV involves dividing our original data set into 10 different data sets, called folds. Each fold will contain roughly the same number of rows.
Question 1
How many rows should be in each fold for our penguins
data? Will all the folds have exactly the same number of rows? Explain
your reasoning.
Once we have determined how many rows need to go in to each fold, we need to complete the process of actually assigning each row to a fold. This involves sampling without replacement. Before we try this on the real data set, let’s try a smaller example.
Example: Assigning Folds
Let’s create a smaller data set to work with to illustrate the code.
Question 2
Create a data set called penguinsSmall
that contains the
first 20 rows in the penguin data set. If we use 5-fold
CV with this smaller data set, how many rows belong in each
fold?
Assigning rows to folds is like assigning people to tables as they enter a room. We have 20 people who will enter a room, and 5 tables for them to sit at. We want the same number of people at each table. To achieve this, we fill a box with 20 slips of paper. Each paper contains a number: 1, 2, 3, 4, or 5. Each person chooses a slip of paper, and the number on the paper tells the person which table to sit at.
Question 3
For our 5-fold CV with penguinsSmall
, how many slips of
paper in this box need to contain the number 4?
Now, in R we aren’t actually creating slips of paper, but that is the idea. To do this in R, we are going to create a vector rather than a box to hold our table (fold) numbers. To do this, use the following:
<- rep(1:5, 4) papers
The command rep
in R means “repeat”. So, this command
means “Repeat the numbers 1 to 5 four times.”
In this code, we create an object called papers
. This
vector holds the numbers 1 through 5 each repeated 4 times.
Question 4
Suppose we have a data set with 500 rows, and we want to use 5-fold
CV. What code would need to create the equivalent of the
papers
vector for this new data set?
Now, we have created our box full of papers. The next step is to have each person (each row) draw a slip of paper (a fold number) as they enter the room. To do this, we will use sampling without replacement.
set.seed(363663)
<- sample(papers, 20, replace = FALSE) foldsSmall
Recall that with the sample
command, the command
structure is sample(What to Sample From, How Many to Sample, Do we
sample with or without replacement).
The vector foldsSmall
that we have just created
indicates which row in our 20 row data set belongs to each of the
5-folds. The first number in foldsSmall
tells us the fold
that the first row of the data is assigned to, and so on.
Question 5
Which fold is the 5th row in the data set assigned to? What about the 20th row?
Back to the Penguin Data
Okay, now that we have seen an example of how to assign the rows in a data set to folds, let’s try it with the \(n=333\) rows in the penguins data set using 10-fold CV.
Question 6
Use R to assign the 333 rows to 10 folds. Use the same seed as we did in the previous section (363663). Which fold is the 1st row in the penguins data set assigned to? What about the second row?
Hint: Make sure you change the names of papers
and
foldSmall
so you don’t replace these vectors with new ones.
We will need the ones from the previous example in the next section!
At this point, we have determined which rows in the data set belong to each fold, but we have not actually created the separate data sets defined by each fold. In other words, we have a vector indicating which row goes in each fold, but we have not created separate data sets for each fold yet. Let’s do that now.
Example: Creating the separate data sets
Let’s go back our smaller data set of 20 rows. We have already
created the vector foldsSmall
which indicates the folds
each row is assigned to. Now, let’s actually figure out which rows are
in each fold.
We are going to start with fold 1. Which rows in
penguinsSmall
are in fold 1?
which(foldsSmall == 1)
Question 7
Which rows in the penguinsSmall
data set are in fold
1?
Question 8
Which rows in the penguinsSmall
data set are in fold
2?
Now that we know which rows are in a fold, we want to actually create a data set that contains those rows. In other words, we want to actually create the different data sets.
To create fold 1, this involves taking all the rows from
penguinsSmall
that are in fold 1 and pulling them from the
larger data set.
<- penguinsSmall[which(foldsSmall==1), ]
fold1 fold1
Question 9
Create fold 2 (meaning show the fold 2 data set) for
penguinsSmall
. Check to make sure the row numbers match
what you have in Question 8!
Okay, so now we can actually create the folds we need! Let’s go back our larger data set.
Back to the Penguin Data
Let’s use the \(n=333\) rows in the penguins data set using 10-fold CV.
Question 10
Create fold 10 (meaning show the fold 10 data set) for
penguins
.
10-fold CV: Running the Loop
Now that we know how to create the folds, let’s move on to the process of running the loop. We need to
- Set f = 1
- Set fold f as test data and the rest of the folds as training data
- Run 10-nearest neighbors to predict penguin sex.
- Store the predictions from 5-nearest neighbors.
- And repeat!
Doing all of this requires building a for loop. To do this, let’s
start with our smaller data set penguinsSmall
, and then we
will move back to the larger data set penguins
.
Example: Inside the loop
The purpose of our for loop is to create the vector \(\hat{Y}\) of predictions for penguin sex on
the penguinsSmall
data set. This means that the first thing
we need to do is create a data frame to store that \(\hat{Y}\) in.
<- data.frame("YHat" = rep(NA,20)) storageSmall
Question 11
What are the dimensions of storageSmall
? (This means how
many rows and how many columns).
We also need to load the library we will need for 5-nearest neighbors.
library(class)
Once this is done, we can start thinking about the loop. Whenever you are writing for loops with a lot of steps, it helps to write the steps that will go inside the loop before you try to build the loop structure. This means that we:
- Set f = 1
<- 1 f
- Set fold 1 as test data and the rest of the folds as training data
- Run 5-nearest neighbors to predict penguin sex for fold 1.
- Store the predictions.
Question 12
Code and run Step 1 - 4 above. Recall that our goal for today is to
predict Y = the sex of the penguin based on the flipper length and bill
length of that penguin. Print out the storageSmall
vector
when you have completed the steps. Which rows are currently filled in?
Does it make sense that these rows are filled in? Explain.
NOTE: When you store the predicted penguin sex, use
as.character(Kpreds)
. Why? Because this time our response
variable is recorded as words rather than numbers.
Question 13
Repeat this process but now move on to fold 2. Print out the
storageSmall
vector when you have completed the steps.
Which rows are currently filled in? Does it make sense that these rows
are filled in? Explain.
Building the inside first makes it easier to catch any errors. This in essence shows us what each part of the loop will do. Once it does what we want, we can write the loop!
Question 14
Create a for loop to run 5-fold CV on the small data set. Print out
the storageSmall
vector when you have completed the
loop.
Back to the Penguin Data
Let’s use the \(n=333\) rows in the penguins data set using 10-fold CV.
Question 15
Create a for loop to run 10-fold CV on the penguins data set. Create a confusion matrix when you finish the loop.
Question 16
What is the geometric mean of sensitivity (Y = 1 = male) and specificity (Y = 0 = female) for 10-fold CV with 5 nearest neighbors? Show your work.
Turning in your assignment
When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2022 September 14.
The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .