STA 363 Lab 4

Complete all Questions and submit your final PDF or html under Assignments in Canvas.

The Goal

We have been learning a lot of code in our quest to explore cross validation techniques. Today, we are going to dig more deeply into the code behind 10-fold CV.

The Data

Our data from today is the cell phone price data from Lab 2. Recall that the goal is to predict $Y$ = cell phone prices. In addition to price, we have 14 features:

battery_power: Total energy a battery can store in one time measured in mAh
clock_speed: speed at which microprocessor executes instructions
fc: Front Camera mega pixels
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of processor
pc: Primary Camera mega pixels
px_height: Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in Mega Bytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: longest time that a single battery charge will last w

To load both the test and training data sets, put the code below into a chunk and press play.

test <- read.csv("https://www.dropbox.com/scl/fi/y1hjakf9e3pkkohkot4sf/test_phones.csv?rlkey=7ujlcxnuvu0lemcoit39t6mcd&st=awcl33xl&dl=1")
train  <- read.csv("https://www.dropbox.com/scl/fi/8iitmby2trytf65pvmk96/train_phones.csv?rlkey=y5rhlbd26d6b20mu56oba4xg7&st=50a35x9h&dl=1")

We will also need the caret and ggplot2 libraries today.

suppressMessages( library(ggplot2) )
suppressMessages( library(caret) )

We are going to use KNN to predict cell phone prices, as we saw in Lab 2 that KNN predicted better than regression for these data. However, we need to choose $K$ for KNN! In Lab 2, we only explored a few choices of $K$. Today, we are going to explore all options $K = 2, 3, 4, ...., 50$.

We have test data today, so it is tempting to use the test data to choose $K$. However, we do not want to use test data to train or tune models or algorithms. Instead, we will use 10-fold CV to create validation data that we can use to choose $K$. With that choice of $K$, we than can fit KNN to our train data and make predictions on our test data to see how well that choice of $K$ would work in practice!

This means validation data has two uses. Validation data can be used to estimate test metrics like the test RMSE when there is no test data present. When test data is present, validation data can be used to choose tuning parameters (like $K$ in KNN) that are needed for a model or algorithm to run.

10-fold CV: Small Data Example

Before we get into actually using 10-fold CV for choosing $K$, we are going to walk through the steps needed to run 10-fold CV.

10-fold CV involves dividing our train data set into 10 different subsets called folds. Each fold will contain roughly the same number of rows, and each row from the original data set will be assigned to exactly one fold.

Question 1

How many rows should be in each fold for our train data? Will all the folds have exactly the same number of rows? Explain your reasoning.

Once we have determined how many rows (cell phones) need to go in to each fold, we need to complete the process of determining which rows will be assigned to each fold. Doing this involves sampling without replacement.

Before we try this on the real data set, let’s create a smaller data set to work with to illustrate the code.

Question 2

Create a data set called trainSmall that contains the first 20 rows in the train data set. If we use 5-fold CV with this smaller data set, how many rows belong in each fold?

Assigning Fold Numbers

Assigning rows to folds is like assigning people to tables as they enter a room. We have 20 people who will enter a room, and 5 tables for them to sit at. We want the same number of people at each table. To achieve this, we fill a box with 20 tickets. Each ticket contains a number: 1, 2, 3, 4, or 5. Each person chooses a ticket, and the number on the ticket tells the person which table to sit at.

Question 3

For our 5-fold CV with trainSmall, how many tickets in this box need to contain the number 4?

Now, in R we aren’t actually creating tickets, but that is the idea. We are actually going to create a vector rather than a box to hold our tickets. To do this, use the following:

tickets <- rep(1:5, 4)

The command rep in R means “repeat”. So, this command means “Repeat the numbers 1 to 5 four times”. Take a look at what is in the vector.

tickets

##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Question 4

Suppose we have a data set with 500 rows, and we want to use 5-fold CV. What code would we need to create the equivalent of the tickets vector for this new data set? Call your vector tickets_Q4.

tickets_Q4 <-

Hint: If you want to show me code, but not run it (which is the case right now since we are working with imaginary data!), change your chunk header from {r} to just r. In other words, delete the {}.

Now that we have we have created our tickets, the next step is to have each person (each row in trainSmall) draw a ticket (a fold number) as they enter the restaurant. To do this, we will use sampling without replacement.

# Set the random seed
set.seed(363663)

# Draw tickets from the box
foldsSmall <- sample(tickets, 20, replace = FALSE)

Recall that with the sample command, the command structure is sample(What to Sample From, How Many to Sample, Do we sample with or without replacement). We are sampling from the box of tickets, 20 people are drawing out a ticket, and they do not put their ticket back once they draw it.

The vector foldsSmall that we have just created indicates which row in our 20 row data set belongs to each of the 5-folds. The first number in foldsSmall tells us the fold that the first row of the data is assigned to, and so on.

Question 5

Which fold is the 5th row in the data set assigned to? What about the 20th row?

At this point, we have determined which rows in the trainSmall data set belong to each fold, but we have not actually created the separate data sets defined by each fold. In other words, each person is assigned a ticket so they know which table number they need to go to, but the people have not actually been shown to their tables. Let’s do that now.

Moving rows into folds (Creating the separate data sets)

We have already created the vector foldsSmall which indicates the folds each row is assigned to. Now, let’s actually figure out which rows are in each fold.

We are going to start with fold 1. Which rows in trainSmall are in fold 1? To determine this, we use the following code:

which(foldsSmall == 1)

In R, == looks to see if two things match. So, 2 ==3 will return FALSE and 2==2 will return TRUE.

Question 6

Which rows in the trainSmall data set are in fold 1?

Question 7

Which rows in the trainSmall data set are in fold 2?

Now that we know which rows are in a fold, we want to actually create a data set that contains those rows. To create fold 1, this involves taking all the rows from trainSmall that are in fold 1 and storing them as a data set called fold1.

fold1 <- trainSmall[which(foldsSmall==1), ]

Question 8

Create fold 2 (meaning show the fold 2 data set) for trainSmall and print out the fold. Check to make sure the row numbers match what you have in Question 7!

Okay, so now we can actually create the folds we need! Let’s go back our larger data set.

10-fold CV on the original Cell Phone Data

Okay, now that we have seen an example of how to assign the rows in a data set to folds, let’s try it with all $n=1500$ rows in the train data set using 10-fold CV.

Question 9

Use R to assign fold numbers to all 1500 rows in the train data set to 10 folds, and store these fold numbers in a vector called folds. Use the same seed as we did in the previous section (363663). Which fold is the 1st row in train assigned to? What about the 2nd row?

Note: You do not need to create any data sets at this point, meaning we are not moving rows into folds yet. You just need to assign fold numbers to each row.

Question 10

Create fold 10, and show the first 5 rows in fold 10 data set for train as the answer to this question. In other words, find all rows assigned to fold 10, move them into a data set, and print out the first 5 rows.

Running the Loop

Okay, now we know how to (1) assign fold numbers to each row and (2) move the rows into smaller data sets called fold based on those fold numbers. At this point, we are ready to run a for loop over the folds to obtain the validation RMSE!

The steps of this process are:

1. Set i = 1
1. Set fold i as validation data and the rest of the folds as new training data.
1. Use KNN on the new training data to predict price on the validation data using all other columns as features.
1. Store the predictions for the rows in the validation data.
1. Repeat Steps 1 - 4 for all 10 folds.
1. Compute the RMSE of the predictions versus the truth in the train data set.

Step 4 in the process involves storing our predictions, so we need to build a data frame to store the predictions in before we run the loop:

# Create storage space 
yhat <- data.frame("price" = rep(NA,1500))

One of the tricks I suggest for writing a for loop is to write the inside first. In other words, write the code that would give the results for i = 1. Once we have done that, it is easier to figure out what parts of the code need to change with f and create the loop. Let’s try it.

Question 11

Code and run Step 1 - 4 above for i=1, using $K = 5$ neighbors for now (we will change that in a minute). Which rows in your yhat are currently filled in? Does it make sense that these rows are filled in? Explain.

Hint: which( is.na(yhat)==FALSE) might help you find the rows that are filled in.

Building the inside first makes it easier to catch any errors. This in essence shows us what each iteration of the loop will do. Once it does what we want, we can write the loop!

Question 12

Create a for loop to run 10-fold CV for the train data using 5-NN. State the validation RMSE.

Reminder: You need to put this function in a chunk and press play before you can compute the RMSE.

compute_RMSE <- function(truth, predictions){

  # Part 1
  part1 <- truth - predictions
  
  #Part 2
  part2 <- part1^2
  
  #Part 3: RSS 
  part3 <- sum(part2)
  
  #Part4: MSE
  part4 <- 1/length(predictions)*part3
  
  # Part 5:RMSE
  sqrt(part4)
}

Considering K

Now, the work we have done so far just found the validation RMSE using 10-fold CV. However, we have test data, so we didn’t really need to create validation data to estimate the test RMSE. What we do need validation for is to help us choose $K$, and the work we have just done is a big part of the code we need for that process. We are going to (1) use 10-fold CV to create validation data to choose $K$, (2) train KNN with train, and (3) assess our predictive ability with test.

In the work we have done so far, we have used 5-NN. However, is that the best choice for the number of neighbors? Could we end up with a smaller validation RMSE if we choose a different number of neighbors $K$? To determine this, we actually need to try out different options of $K$.

Question 13

Adapt your code in Question 12 to find the validation RMSE for $K = 2$. Would you recommend using $K=2$ or $K=5$ based on the validation RMSE?

Ideally, we want to be able to do what we did in Question 13 (compare choices of $K$) for many options of $K$. Our data set for today is small enough that we can consider a lot of options and see which one works best, so let’s try $K = 2, 3, 4, 5, 6, \dots, 49, 50$.

Question 14

When we want to something in coding over and over again, changing just one small thing (like the value of $K$), what coding structure helps us to do that?

We already have one for loop to run 10-fold CV. To change $K$, we need to wrap another for loop around that for loop. The code structure we need for this is called a Nested for-loop.

A Nested for-loop looks like this:

# Outer Loop
for( firstindex in SOMETHING){

  # Inner Loop 
  for( secondindex in SOMETHING ELSE){
  
 
  }

}

Example: Let’s suppose for each number 1 - 5, I want to add the numbers 3 - 10. To do this using a nested for-loop we use the following:

# Outer Loop
for( i in 1:5){

  # Inner Loop 
  for( j in 3:10){
    
    print( i + j)
 
  }

}

Generally, when I write nested loops I start with the inner loop to make sure it runs. Our goal is to find a validation RMSE for KNN using 10-fold CV with $K = 2, 3, 4, 5, 6, \dots, 49, 50$. This means inner loop is 10-fold CV, and we already know that code is:

yhat <- data.frame("price" = rep(NA,1500))

for( i in 1:10){
  
  # Step 1 
  infold <- which(folds == i)

  # Step 2 
  newTrain <- train[ -infold,]
  validation  <- train[infold, ]

  # Step 3
  yhat[infold,"price"] <- knnregTrain( newTrain[,-15] , validation[,-15], newTrain[ , 15], k = 5)

}

compute_RMSE( train$Y, yhat$price)

I then think about the outer loop as wrapping around that inner loop and allowing something (in this case $K$) within the inner loop to change.

# Outer loop 
for(k in SOMETHING){ 

  yhat <- data.frame("price" = rep(NA,1500))
  
  # Inner loop 
  for( i in 1:10){
  
    # Step 1 
    infold <- which(folds == i)

    # Step 2 
    newTrain <- train[ -infold,]
    validation  <- train[infold, ]

    # Step 3
    yhat[infold,"price"] <- knnregTrain( newTrain[,-15] , validation[,-15], newTrain[ , 15], k = SOMETHING ELSE )

  }
  # Close inner loop

  compute_RMSE( train$Y, yhat$price)
  
} 
# Close outer loop

Question 15

What should the SOMETHING be replaced with in the code above?

Hint: Our goal is to find the validation RMSE for $K = 2, 3, 4, 5, 6, \dots, 49, 50$.

Question 16

What should the SOMETHING ELSE be replaced with in the code above?

Hint: Our goal is to find the validation RMSE for $K = 2, 3, 4, 5, 6, \dots, 49, 50$.

The only trick now is the compute_RMSE part. This part of the code will compute the validation RMSE for each choice of $K$. However, as written now the loop will print the validation RMSE as the code runs. We do not really want 49 values of the RMSE to print out- that would make it really hard to figure out which choice of $K$ gives us the lowest validation RMSE. Instead, we want to store the RMSE each time through the outer loop.

Question 17

Write a code to create a data frame called RMSE_storage with 2 columns (called K and RMSE) with 49 rows such that column 1 is filled in with the numbers 2:50 (our options for $K$) and column RMSE is blank (filled with NA)? As an answer to this question, print out RMSE_storage.

Question 18

In the nested for loop, should the code from Question 17 go:

1. Before the outer loop
1. Inside the outer loop but before the inner loop starts
1. Inside the inner loop
1. After the inner loop but inside the outer loop
1. After the outer loop

At this point, we have done essentially everything except adapt the code so that we store the validation RMSE in RMSE_storage rather than printing it out each time the outer loop runs. Let’s try our hand at running the whole loop.

Question 19

Using the code above Question 15 and the answers to Questions 15-18 as guides, write a code that for $K = 2, 3, 4, \dots, 50$ estimates the test RMSE using 10 fold-CV and stores the validation RMSE estimates in RMSE_storage. After running the code, print out RMSE_storage as the answer to this question.

Hint: If you get NAs in RMSE_storage, something went wrong. Let Dr. Dalzell know if you need help!

Visualizing Results

At this point, we have the validation RMSE estimate for every choice of $K = 2, 3, \dots, 49, 50$. Now, we need to figure out which validation RMSE is smallest so we can choose $K$.

We can find the smallest validation RMSE by looking through RMSE_out, but this can be challenging if we try many values of $K$. Instead, let’s try a plot.

smallestRMSE <- c(2:50)[which.min(RMSE_storage$RMSE)] 
  
ggplot( data.frame(RMSE_storage), aes( x = K, y = RMSE)) + geom_point() + geom_vline( xintercept = smallestRMSE, col = "blue" , lty = 2 ) +  labs(caption = paste("Validation RMSE values for Different Tuning Parameters. Smallest RMSE at K = ", smallestRMSE), y = "Validation RMSE", x = "Tuning Parameter")

Question 20

Show the plot and state which choice of $K$ you would recommend.

Question 21

There is a default choice of $K$ we use on large data sets when running a for loop isn’t feasible. Would that choice of $K$ be a good choice here? Explain.

Back to the test data

Okay, so we used 10-fold CV to create validation data that allowed us to choose $K$ for KNN. Now that we have our choice of $K$, let’s use it to make predictions on our test data.

Question 22

Using the $K$ number of neighbors you chose in Question 20, train KNN using the train data. Then, make predictions on the test data and compute the test RMSE. State and interpret the test RMSE as the answer to this question.

Hint: No loops in this question!!

Question 23

Is the value for the validation RMSE in Question 20 and the test RMSE in Question 22 the same? Do we expect them to be? Explain why or why not.

Before you submit

A few last steps before we knit, and then you will be done with Lab 4!

1. Find the top of this file (the little tab), and look under it. You should see something with ABC and a check mark. This is for checking spelling! Click this to check your spelling before you do a final knit and submit.
1. If you are working with a partner, make sure their name and yours is on the top of the file.
1. Look at the very first chunk in your Markdown file. You should see something like:

knitr::opts_chunk$set(echo = TRUE)

Change this to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)

1. You must submit a PDF or HTML file. If you submit any other file type, it cannot be graded. Let me know if you have any questions.

Once you’ve done this, knit your file. This will create the PDF or html you need to submit. If you get stuck, let Dr. Dalzell know!

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2026 January 26.

The data has been adapted from the source below. Please note that the data sets have been cleaned and a new column, price, has been simulated and added to the original data set.

Citation: Abhishek, Sharma. (2017r). Mobile Price Classification, Version 1. Retrieved December 20, 2025 from https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification