STA 363 Lab 4

Complete all Questions and submit your final PDF or html under Assignments in Canvas.

The Goal

We have been learning a lot of code in our quest to explore cross validation techniques. Today, we are going to dig more deeply into the code behind 10-fold CV.

The Data

We are going to stick with our penguins data set from last lab. To load the data, put the following lines of code inside a chunk and press play:

library(palmerpenguins)
data("penguins")
penguins <- na.omit(penguins)
# Remove the feature year: we don't want to use it today
penguins <- data.frame(penguins)[,-8]

Our goal for today is to predict \(Y\) = the body mass of the penguin. As we are provided with only one data set, we will need to determine a way to assess the predictive accuracy of any technique we choose. We will do this today using 10-fold CV. 10-fold CV involves dividing our original data set into 10 different subsets called folds. Each fold will contain roughly the same number of rows, and each row from the original data set will be assigned to exactly one fold.

In class, we discussed the folds in terms of tables at a restaurant. We have 10 folds (so 10 tables) and now we need to think about how many penguins need to be assigned to each fold (table).

Question 1

How many rows should be in each fold for our penguins data? Will all the folds have exactly the same number of rows? Explain your reasoning.

Once we have determined how many rows (penguins) need to go in to each fold (table), we need to complete the process of actually assigning each row to a fold (penguin to a table). This involves sampling without replacement. Before we try this on the real data set, let’s try a smaller example.

Small Data Example

Let’s create a smaller data set to work with to illustrate the code.

Question 2

Create a data set called penguinsSmall that contains the first 20 rows in the penguin data set. If we use 5-fold CV with this smaller data set, how many rows belong in each fold?

Assigning Fold Numbers

Assigning rows to folds is like assigning people to tables as they enter a room. We have 20 people who will enter a room, and 5 tables for them to sit at. We want the same number of people at each table. To achieve this, we fill a box with 20 tickets. Each ticket contains a number: 1, 2, 3, 4, or 5. Each person chooses a ticket, and the number on the ticket tells the person which table to sit at.

Question 3

For our 5-fold CV with penguinsSmall, how many tickets in this box need to contain the number 4?

Now, in R we aren’t actually creating tickets, but that is the idea. To do this in R, we are going to create a vector rather than a box to hold our tickets. To do this, use the following:

tickets <- rep(1:5, 4)

The command rep in R means “repeat”. So, this command means “Repeat the numbers 1 to 5 four times”. Take a look at what is in the vector.

tickets

Question 4

Suppose we have a data set with 500 rows, and we want to use 5-fold CV. What code would we need to create the equivalent of the tickets vector for this new data set? Call your vector tickets_Q4.

tickets_Q4 <-

Now that we have we have created our tickets, the next step is to have each person (each row) draw a ticket (a fold number) as they enter the restaurant. To do this, we will use sampling without replacement.

# Set the random seed
set.seed(363663)

# Draw tickets from the box
foldsSmall <- sample(tickets, 20, replace = FALSE)

Recall that with the sample command, the command structure is sample(What to Sample From, How Many to Sample, Do we sample with or without replacement). We are sampling from the box of tickets, 20 people are drawing out a ticket, and they do not put their ticket back once they draw it.

The vector foldsSmall that we have just created indicates which row in our 20 row data set belongs to each of the 5-folds. The first number in foldsSmall tells us the fold that the first row of the data is assigned to, and so on.

Question 5

Which fold is the 5th row in the data set assigned to? What about the 20th row?

At this point, we have determined which rows in the penguinsSmall data set belong to each fold, but we have not actually created the separate data sets defined by each fold. In other words, each penguin is holding a ticket so they know which table number they need to go to, but we have not actually moved the penguins to their tables. Let’s do that now.

Moving rows into folds (Creating the separate data sets)

We have already created the vector foldsSmall which indicates the folds each row is assigned to. Now, let’s actually figure out which rows are in each fold.

We are going to start with fold 1. Which rows in penguinsSmall are in fold 1? To determine this, we use the following code:

which(foldsSmall == 1)

In R, == looks to see if two things match. So, 2 ==3 will return FALSE and 2==2 will return TRUE.

Question 6

Which rows in the penguinsSmall data set are in fold 1?

Question 7

Which rows in the penguinsSmall data set are in fold 2?

Now that we know which rows are in a fold, we want to actually create a data set that contains those rows. To create fold 1, this involves taking all the rows from penguinsSmall that are in fold 1 and storing them as a data set called fold1.

fold1 <- penguinsSmall[which(foldsSmall==1), ]

Question 8

Create fold 2 (meaning show the fold 2 data set) for penguinsSmall and print out the fold. Check to make sure the row numbers match what you have in Question 7!

Okay, so now we can actually create the folds we need! Let’s go back our larger data set.

Back to the Penguin Data

Okay, now that we have seen an example of how to assign the rows in a data set to folds, let’s try it with all \(n=333\) rows in the penguins data set using 10-fold CV.

Question 9

Use R to assign all 333 rows in the penguins data set to 10 folds, and store them in a vector called folds. Use the same seed as we did in the previous section (363663). Which fold is the 1st row in the penguins data set assigned to? What about the 2nd row?

Question 10

Create fold 10 (meaning show the fold 10 data set) for penguins.

10-fold CV: Running the Loop

Now that we know how to create the folds, let’s move on to the process of running the loop. We need to:

Set f = 1
Set fold f as validation data and the rest of the folds as training data
Build a linear regression model to predict body mass using all other columns as features.
Store the predictions for the rows in the validation data.
And repeat!

This should sound like a for loop, because it is one! We also need one step before we run the loop, which is creating our storage space:

# Create storage space 
predictions <- data.frame("body_mass_g" = rep(NA,n))

One of the tricks to making a for loop is to write the inside first. In other words, write the code that would give the results for f = 1. Once we have done that, it is easier to figure out what parts of the code need to change with f and create the loop. Let’s try it.

Question 11

Code and run Step 1 - 4 above for f=1. Show your code. Which rows in your predictions are currently filled in? Does it make sense that these rows are filled in? Explain.

Hint: which( is.na(predictions)==FALSE) might help you find the rows that are filled in.

Building the inside first makes it easier to catch any errors. This in essence shows us what each iteration of the loop will do. Once it does what we want, we can write the loop!

Question 12

Create a for loop to run 10-fold CV for the penguins data. State the validation RMSE.

Reminder: You need to put this function in a chunk and press play before you can compute the RMSE.

compute_RMSE <- function(truth, predictions){

  # Part 1
  part1 <- truth - predictions
  
  #Part 2
  part2 <- part1^2
  
  #Part 3: RSS 
  part3 <- sum(part2)
  
  #Part4: MSE
  part4 <- 1/length(predictions)*part3
  
  # Part 5:RMSE
  sqrt(part4)
}

KNN

Now that we have seen how the process words, let’s try another technique to predict body mass: KNN.

Look at Row 1 in the penguins data and Row 53 in the penguins data.

	1	53
species	Adelie	Adelie
island	Torgersen	Biscoe
bill_length_mm	39.1	36.5
bill_depth_mm	18.7	16.6
flipper_length_mm	181	181
body_mass_g	3750	2850
sex	male	female

We want to use the Gower’s Distance to find the distance between these two rows. Since we have two variables that are categorical with more than two levels, our code will actually “see” a slightly different version of the two rows.

	1	53
speciesChinstrap	0.0	0.0
speciesGentoo	0.0	0.0
islandDream	0.0	0.0
islandTorgersen	1.0	0.0
bill_length_mm	39.1	36.5
bill_depth_mm	18.7	16.6
flipper_length_mm	181.0	181.0
sexmale	1.0	0.0

Question 13

Using the data shown above, find the Gower Distance between Row 1 and Row 53.

Note: It is good practice to compute this by hand.

We are going to use 10-fold CV to estimate the test RMSE of 5-NN (5-nearest neighbors). Before running KNN, we need to load the StatMatch library and teach R the function to compute Gower’s Distance.

suppressMessages(library(StatMatch))

knnGower <- function(trainX, testX, trainY, K){
  # Convert the data 
  trainX <- model.matrix(trainY~ ., data = trainX)[,-1]
  holder <- 1:nrow(testX)
  testX <- model.matrix(holder~ ., data = testX)[,-1]
  
  # Find the Gower Distance
  gowerHold <- StatMatch::gower.dist( testX, trainX)
  # For each row, find the k smallest
  neighbors <- apply(gowerHold, 1, function(x) which(x %in% sort(x)[1:K]))
  
  if(class(neighbors)[1]=="integer"){
    preds <- trainY[neighbors]
  }
  
  # Take the mean to get the prediction
  if(class(neighbors)[1]=="matrix"){
    preds <- apply(neighbors, 2,function(x) mean(trainY[x]))
  }
  
  if(class(neighbors)[1]=="list"){
    preds <- lapply(neighbors, function(x) mean(trainY[x]))
    preds <- unlist(preds)
  }
  # Return the predictions
  unlist(preds)
}

Recall that to use this function, the structure is:

knnGower( trainingFeatures, testFeatures, trainingResponse, K = numberOfNeighbors)

We want to use 10-fold CV to assess the predictive accuracy of this approach. Our steps for the 10-fold CV with KNN are:

Set f = 1
Set fold f as validation data and the rest of the folds as training data
Use 5-NN to make predictions for the rows in the validation data.
Store the predictions
And repeat!

Question 14

Using 5-NN and \(k = 10\) folds (yes, I know the notation is annoying, but we’re stuck with it!), write a for loop to use 5-NN to make predictions for body mass. Show your code.

Question 15

State the validation RMSE for 5-NN. Is it larger or smaller than what you got using linear regression?

Considering K

In the work we have done so far, we have used 5-NN. However, is that the best choice for the number of neighbors? Could we end up with a smaller validation RMSE if we choose a different number of neighbors \(K\)?

To determine this, we actually need to try out different options of \(K\). This means not only do we need a for loop that loops over the folds, we also need one that allows us to explore different choices of \(K\). The code structure we need for this is called a Nested for-loop. You will work with this for your Project 1.

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2024 September 22.

The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .