Today, we are going to apply KNN to our penguin data, just like we did in Lab 2. However, now we are going to use KNN for a different purpose. In class so far, we have only seen how KNN can be used for classification tasks. Today, we are going to see how we can use the method when our task is regression.
To load the data, put the following two lines of code inside a chunk and press play:
library(palmerpenguins)
data("penguins")
As a reminder, this data set contains information on n = 344 penguins. However, there is some missing data. KNN (and indeed most models) cannot be applied when there is any missing data, so we will begin by removing the rows with missing data.
penguins <- na.omit(penguins)
Our goal for today is to predict the body mass of the penguin based on the flipper length and bill length of that penguin. This means that our response variable Y is a numeric variable, so this is a regression task, and not a classification task.
We are going to start this lab using the entire penguins data set to learn the KNN approach when the goal is regression. Then, we will explore the idea of using the original training data to create training and test data to help us choose K.
The KNN approach we have seen so far takes several numeric features and uses them to predict Y. However, the Ys we have worked with so far are categorical, so we ended up making our predictions by choosing the most popular value of Y in the K nearest points to any given row in the data set.
When the goal is regression, we can't use the same idea of popularity. There are too many unique values of Y that are possible. Indeed, quite frequently all of the K nearest neighbors to a point have a unique value of Y. So what do we do?
Now, we want to highlight a particular point on this graph - the first penguin. To do this, we add a line of code at the end of our plot code: + geom_point(data=penguins[1,], color = "red", pch = 22, lwd = 3)
. This adds a red square around the point for the first penguin in the data set.
Now that we can see the first penguin, let's think about what prediction KNN would make for that point. For now, let K = 5. KNN looks at the 5 nearest neighbors (in terms of Euclidean distance) and then finds the Y values associated with those points. The predicted value of Y for our first penguin will be the average of the 5 Y values of the nearest neighbors to that first penguin.
Now, we are not going to manually identify the nearest neighbors of each penguin and average their body masses to get a prediction. Instead, we will use R to help us do this. Using KNN with regression requires slightly different coding than we use for KNN with classification. Instead of the function knn()
, we need the code knnreg()
, which can be found in the caret
package in R.
library(caret)
To actually use the code, we need to supply the features (penguins[,c(3,5)]
) and the response (penguins$body_mass_g
), as well as our choice of K.
knnOut <- knnreg(penguins[,c(3,5)], penguins$body_mass_g, k = 5)
This trains the method, but does not create predictions. To create predictions, we need one more line of code:
knnPred <- predict(knnOut, newdata = penguins[,c(3,5)])
knnPred
: (A) a matrix, (B) a data frame, (C) a vector or (D) a scalar.The predict
code in R is used with a variety of models, including LSLR, to make predictions. The first input is the trained model (knnOut
) and the second output is the data we want to make predictions on (in this case the training data penguins[,c(3,5)]
.
Hey, we can use residuals again, just like we did in LSLR! This means that one metric we can use to evaluate how well the method is doing at taking predictions is the mean-squared error, or MSE. The MSE is just the residual sum of squares (RSS) divided by the number of data points we are making predictions on. With the n rows in the training data, we have:
\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i -\hat{y}_i)^2 \]
knnPred
.The MSE is the average of the squared residuals, which means the units of the MSE is the units of the response variable but squared. In this case, this means that the MSE is on the scale of grams squared. To put the metric back on the same scale as Y (grams) we sometimes use the Root Mean squared error, or RMSE:
\[ RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i -\hat{y}_i)^2 }\]
Okay, so now we can use KNN with regression problems as well as classification. However, we still have the problem of (1) choosing K and (2) assessing how well the model will do at predicting for test data. Let's address both of these.
Last class, we discussed one approach to creating test data, which involves dividing the original training data set into two new data sets - new training and new test.
Now that we have our test and training data, it is time to use it to determine K. We want to try a variety of choices of K and determine which choice gives us the best prediction on the new test data.
Now, we have seen that for our choice of random split into new training and new test data, we end up with the choice of K you stated in the previous question. However, as we have seen in class, this choice of K can change depending on which rows of data end up in the new training data set versus the new test data set. Let's see if we can't explore that for ourselves.
This method of manually running through the process with a different seed is fine, but it is tedious if we wanted to explore many different choices of seed. One alternative is to create a function that will help us. For now, this function needs to take just one input, the random seed, since we are not changing anything else. The inside of the function will need to do basically everything you did in the previous question, and the goal is to produce the graph of the RMSE. Try it out!
simulationSeed()
that takes the input randomseed
and produces the desired graph. Show your code, and run the function once to make sure it works!In our next class, we will talk about two ways to estimate the predictive accuracy without relying on a single split into test and training data. These two methods will rely on the coding we did today, so make sure to ask if you have any questions.
One last step before we knit. Look at the very first chunk in your Markdown file. You should see something like knitr::opts_chunk$set(echo = TRUE)
. For this lab, I need to see your code, so make sure you see this. Typically, we change this to knitr::opts_chunk$set(echo = FALSE)
to hide all the code you have created from your final document, but I need to see your code for today.