STA 363 Lab 3: KNN for Regression

Complete all Questions and submit your final PDF or html under Assignments in Canvas.

The Goal

Today, we are going to apply KNN to our penguin data, just like we did in Lab 2. However, now we are going to use KNN for a different purpose. In class so far, we have only seen how KNN can be used for classification tasks. Today, we are going to see how we can use the method when our task is regression.

The Data

To load the data, put the following two lines of code inside a chunk and press play:

library(palmerpenguins)
data("penguins")

As a reminder, this data set contains information on n = 344 penguins. However, there is some missing data. KNN (and indeed most models) cannot be applied when there is any missing data, so we will begin by removing the rows with missing data.

penguins <- na.omit(penguins)

Our goal for today is to predict the body mass of the penguin based on the flipper length and bill length of that penguin. This means that our response variable Y is a numeric variable, so this is a regression task, and not a classification task.

We are going to start this lab using the entire penguins data set to learn the KNN approach when the goal is regression. Then, we will explore the idea of using the original training data to create training and test data to help us choose K.

Regression with KNN

The KNN approach we have seen so far takes several numeric features and uses them to predict Y. However, the Ys we have worked with so far are categorical, so we ended up making our predictions by choosing the most popular value of Y in the K nearest points to any given row in the data set.

When the goal is regression, we can't use the same idea of popularity. There are too many unique values of Y that are possible. Indeed, quite frequently all of the K nearest neighbors to a point have a unique value of Y. So what do we do?

Create a scatter plot with bill length on the X axis and flipper length on the Y axis. These are our two numeric features that we will be using. Label your axes and add and appropriate title.

Now, we want to highlight a particular point on this graph - the first penguin. To do this, we add a line of code at the end of our plot code: + geom_point(data=penguins[1,], color = "red", pch = 22, lwd = 3). This adds a red square around the point for the first penguin in the data set.

Using the code above, adapt your plot from Question 1.

Now that we can see the first penguin, let's think about what prediction KNN would make for that point. For now, let K = 5. KNN looks at the 5 nearest neighbors (in terms of Euclidean distance) and then finds the Y values associated with those points. The predicted value of Y for our first penguin will be the average of the 5 Y values of the nearest neighbors to that first penguin.

Suppose we have a penguin whose 5 nearest neighbors have body masses of 4000, 4250, 4150, 4350, and 3900 grams. What would KNN (with K = 5) predict for the body mass of this penguin?

Now, we are not going to manually identify the nearest neighbors of each penguin and average their body masses to get a prediction. Instead, we will use R to help us do this. Using KNN with regression requires slightly different coding than we use for KNN with classification. Instead of the function knn(), we need the code knnreg(), which can be found in the caret package in R.

library(caret)

To actually use the code, we need to supply the features (penguins[,c(3,5)]) and the response (penguins$body_mass_g), as well as our choice of K.

knnOut <- knnreg(penguins[,c(3,5)], penguins$body_mass_g, k = 5)

This trains the method, but does not create predictions. To create predictions, we need one more line of code:

knnPred <- predict(knnOut, newdata = penguins[,c(3,5)])

What type of object is knnPred: (A) a matrix, (B) a data frame, (C) a vector or (D) a scalar.

The predict code in R is used with a variety of models, including LSLR, to make predictions. The first input is the trained model (knnOut) and the second output is the data we want to make predictions on (in this case the training data penguins[,c(3,5)].

What is the residual for the first penguin in the data set? Hint: as a reminder, a residual is the true value of Y minus the prediction.

Hey, we can use residuals again, just like we did in LSLR! This means that one metric we can use to evaluate how well the method is doing at taking predictions is the mean-squared error, or MSE. The MSE is just the residual sum of squares (RSS) divided by the number of data points we are making predictions on. With the n rows in the training data, we have:

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i -\hat{y}_i)^2 \]

Using R, find the MSE for KNN with the training data (called the training MSE). Show your code. Hint: We have a column containing Y in our data, and a column of predictions stored in knnPred.

The MSE is the average of the squared residuals, which means the units of the MSE is the units of the response variable but squared. In this case, this means that the MSE is on the scale of grams squared. To put the metric back on the same scale as Y (grams) we sometimes use the Root Mean squared error, or RMSE:

\[ RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i -\hat{y}_i)^2 }\]

Using R, find the RMSE for KNN on the training data. We call this the training RMSE.

Use the summary command in R to look at the body mass of the penguins in the data. Use the RSME to comment on how well KNN seems to be predicting body mass.

Now, use KNN to predict body mass using the 4 features flipper length, bill length, bill depth, and year. State the training RMSE, and comment on whether adding in these features seems to have improved prediction on the training data.

Okay, so now we can use KNN with regression problems as well as classification. However, we still have the problem of (1) choosing K and (2) assessing how well the model will do at predicting for test data. Let's address both of these.

Creating Test Data

Last class, we discussed one approach to creating test data, which involves dividing the original training data set into two new data sets - new training and new test.

Divide the training data into these two data sets (new training and new test) using a seed of 100. The new training data set should have 80% of the original training data and the other 20% should be in the new test data. Show your code as your answer to this question.

We are going to work with this for today, but what are two concerns with using this approach to create test data?

Choosing K

Now that we have our test and training data, it is time to use it to determine K. We want to try a variety of choices of K and determine which choice gives us the best prediction on the new test data.

We are going to use KNN to predict body mass using the two features bill length and flipper length, and are considering K = 1, 2, ..., 25. Write and run a for loop that can obtain the test RMSE for each of these choices of K. Show your annotated code as the answer to this question. Let me know if you get stuck!

Create a graph showing your test RMSE results. Make sure you have labeled your axes, included a caption, and have a red dashed line indicating the K with the best RMSE. Look at the slides from last class if you get stuck!

Using the graph and the output of the loop, which choice of K would you choose and why?

Combining the two

Now, we have seen that for our choice of random split into new training and new test data, we end up with the choice of K you stated in the previous question. However, as we have seen in class, this choice of K can change depending on which rows of data end up in the new training data set versus the new test data set. Let's see if we can't explore that for ourselves.

Using a new random seed, split the original training data into new training and new test. Then, run the for loop from Question 12 again and create a graph of your results as you did in Question 13. Show your plot, and state what choice of K you would recommend now.

This method of manually running through the process with a different seed is fine, but it is tedious if we wanted to explore many different choices of seed. One alternative is to create a function that will help us. For now, this function needs to take just one input, the random seed, since we are not changing anything else. The inside of the function will need to do basically everything you did in the previous question, and the goal is to produce the graph of the RMSE. Try it out!

Create a function called simulationSeed() that takes the input randomseed and produces the desired graph. Show your code, and run the function once to make sure it works!

In our next class, we will talk about two ways to estimate the predictive accuracy without relying on a single split into test and training data. These two methods will rely on the coding we did today, so make sure to ask if you have any questions.

Before you submit

One last step before we knit. Look at the very first chunk in your Markdown file. You should see something like knitr::opts_chunk$set(echo = TRUE). For this lab, I need to see your code, so make sure you see this. Typically, we change this to knitr::opts_chunk$set(echo = FALSE) to hide all the code you have created from your final document, but I need to see your code for today.

Turning in your assignment

You must submit a PDF document to Canvas. No other formats will be accepted.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 January 25.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.

The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .