STA 363 Lab 4

Complete all Questions and submit your final PDF or html under Assignments in Canvas.

The Goal

We have been learning a lot of code in our quest to explore cross validation techniques. Today, we are going to dig more deeply into the code behind 10-fold cross validation to make sure we understand the steps. This will help when you work on your first project!

The Data

We are going to go back to the data from our very first lab, which involves penguins. To load the data, put the following two lines of code inside a chunk and press play:

library(palmerpenguins)
data("penguins")

As a reminder, this data set contains information on n = 344 penguins. However, there is some missing data, so we will begin by removing the rows with missing data.

penguins <- na.omit(penguins)

Our goal for today is to predict Y = the sex of the penguin based on the flipper length and bill length of that penguin using 5-nearest neighbors. As we are provided with only one data set, we will need to determine a way to assess the predictive accuracy of any technique we choose. We will do this today using 10-fold CV.

10-fold CV: Creating Folds

10-fold CV involves dividing our original data set into 10 different data sets, called folds. Each fold will contain roughly the same number of rows.

Question 1

How many rows should be in each fold for our penguins data? Will all the folds have exactly the same number of rows? Explain your reasoning.

Once we have determined how many rows need to go in to each fold, we need to complete the process of actually assigning each row to a fold. This involves sampling without replacement. Before we try this on the real data set, let’s try a smaller example.

Example: Assigning Folds

Let’s create a smaller data set to work with to illustrate the code.

Question 2

Create a data set called penguinsSmall that contains the first 20 rows in the penguin data set. If we use 5-fold CV with this smaller data set, how many rows belong in each fold?

Assigning rows to folds is like assigning people to tables as they enter a room. We have 20 people who will enter a room, and 5 tables for them to sit at. We want the same number of people at each table. To achieve this, we fill a box with 20 slips of paper. Each paper contains a number: 1, 2, 3, 4, or 5. Each person chooses a slip of paper, and the number on the paper tells the person which table to sit at.

Question 3

For our 5-fold CV with penguinsSmall, how many slips of paper in this box need to contain the number 4?

Now, in R we aren’t actually creating slips of paper, but that is the idea. To do this in R, we are going to create a vector rather than a box to hold our table (fold) numbers. To do this, use the following:

papers <- rep(1:5, 4)

The command rep in R means “repeat”. So, this command means “Repeat the numbers 1 to 5 four times.”

In this code, we create an object called papers. This vector holds the numbers 1 through 5 each repeated 4 times.

Question 4

Suppose we have a data set with 500 rows, and we want to use 5-fold CV. What code would need to create the equivalent of the papers vector for this new data set?

Now, we have created our box full of papers. The next step is to have each person (each row) draw a slip of paper (a fold number) as they enter the room. To do this, we will use sampling without replacement.

set.seed(363663)
foldsSmall <- sample(papers, 20, replace = FALSE)

Recall that with the sample command, the command structure is sample(What to Sample From, How Many to Sample, Do we sample with or without replacement).

The vector foldsSmall that we have just created indicates which row in our 20 row data set belongs to each of the 5-folds. The first number in foldsSmall tells us the fold that the first row of the data is assigned to, and so on.

Question 5

Which fold is the 5th row in the data set assigned to? What about the 20th row?

Back to the Penguin Data

Okay, now that we have seen an example of how to assign the rows in a data set to folds, let’s try it with the \(n=333\) rows in the penguins data set using 10-fold CV.

Question 6

Use R to assign the 333 rows to 10 folds. Use the same seed as we did in the previous section (363663). Which fold is the 1st row in the penguins data set assigned to? What about the second row?

Hint: Make sure you change the names of papers and foldSmall so you don’t replace these vectors with new ones. We will need the ones from the previous example in the next section!

At this point, we have determined which rows in the data set belong to each fold, but we have not actually created the separate data sets defined by each fold. In other words, we have a vector indicating which row goes in each fold, but we have not created separate data sets for each fold yet. Let’s do that now.

Example: Creating the separate data sets

Let’s go back our smaller data set of 20 rows. We have already created the vector foldsSmall which indicates the folds each row is assigned to. Now, let’s actually figure out which rows are in each fold.

We are going to start with fold 1. Which rows in penguinsSmall are in fold 1?

which(foldsSmall == 1)

Question 7

Which rows in the penguinsSmall data set are in fold 1?

Question 8

Which rows in the penguinsSmall data set are in fold 2?

Now that we know which rows are in a fold, we want to actually create a data set that contains those rows. In other words, we want to actually create the different data sets.

To create fold 1, this involves taking all the rows from penguinsSmall that are in fold 1 and pulling them from the larger data set.

fold1 <- penguinsSmall[which(foldsSmall==1), ]
fold1

Question 9

Create fold 2 (meaning show the fold 2 data set) for penguinsSmall. Check to make sure the row numbers match what you have in Question 8!

Okay, so now we can actually create the folds we need! Let’s go back our larger data set.

Back to the Penguin Data

Let’s use the \(n=333\) rows in the penguins data set using 10-fold CV.

Question 10

Create fold 10 (meaning show the fold 10 data set) for penguins.

10-fold CV: Running the Loop

Now that we know how to create the folds, let’s move on to the process of running the loop. We need to

Set f = 1
Set fold f as test data and the rest of the folds as training data
Run 10-nearest neighbors to predict penguin sex.
Store the predictions from 5-nearest neighbors.
And repeat!

Doing all of this requires building a for loop. To do this, let’s start with our smaller data set penguinsSmall, and then we will move back to the larger data set penguins.

Example: Inside the loop

The purpose of our for loop is to create the vector \(\hat{Y}\) of predictions for penguin sex on the penguinsSmall data set. This means that the first thing we need to do is create a data frame to store that \(\hat{Y}\) in.

storageSmall <- data.frame("YHat" = rep(NA,20))

Question 11

What are the dimensions of storageSmall? (This means how many rows and how many columns).

We also need to load the library we will need for 5-nearest neighbors.

library(class)

Once this is done, we can start thinking about the loop. Whenever you are writing for loops with a lot of steps, it helps to write the steps that will go inside the loop before you try to build the loop structure. This means that we:

Set f = 1

f <- 1

Set fold 1 as test data and the rest of the folds as training data
Run 5-nearest neighbors to predict penguin sex for fold 1.
Store the predictions.

Question 12

Code and run Step 1 - 4 above. Recall that our goal for today is to predict Y = the sex of the penguin based on the flipper length and bill length of that penguin. Print out the storageSmall vector when you have completed the steps. Which rows are currently filled in? Does it make sense that these rows are filled in? Explain.

NOTE: When you store the predicted penguin sex, use as.character(Kpreds). Why? Because this time our response variable is recorded as words rather than numbers.

Question 13

Repeat this process but now move on to fold 2. Print out the storageSmall vector when you have completed the steps. Which rows are currently filled in? Does it make sense that these rows are filled in? Explain.

Building the inside first makes it easier to catch any errors. This in essence shows us what each part of the loop will do. Once it does what we want, we can write the loop!

Question 14

Create a for loop to run 5-fold CV on the small data set. Print out the storageSmall vector when you have completed the loop.

Back to the Penguin Data

Let’s use the \(n=333\) rows in the penguins data set using 10-fold CV.

Question 15

Create a for loop to run 10-fold CV on the penguins data set. Create a confusion matrix when you finish the loop.

Question 16

What is the geometric mean of sensitivity (Y = 1 = male) and specificity (Y = 0 = female) for 10-fold CV with 5 nearest neighbors? Show your work.

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 September 14.

The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .