STA 363 Lab 2: KNN

Complete all Questions and submit your final PDF or html under Assignments in Canvas.

The Goal

Today, we are going to apply KNN to our penguin data from Lab 1. Our goal is to practice using the coding for KNN, as well practicing a few key skills in R and assessing prediction accuracy.

The Data

To load the data, put the following two lines of code inside a chunk and press play:

library(palmerpenguins)
data("penguins")

As a reminder, this data set contains information on n = 344 penguins. However, there is some missing data. KNN (and indeed most models) cannot be applied when there is any missing data, so we will begin by removing the rows with missing data.

penguins <- na.omit(penguins)

Our goal for today is to predict the sex of a penguin based on features of that penguin. Birds in particular are very difficult to distinguish in terms of sex, especially when the birds are young, so such a predictive model can be very useful to ecologists.

Now, as we discussed in class, we typically are presented with two different data sets when our goal is prediction. The first, the training data set, is used to conduct EDA and to train our model. The second, the test data set, is used to test the accuracy of our model.

Here, we are only provided with one data set.To allow us to practice, we are going to manually divide the data into test and training data sets. Note: This is NOT something we do in real life, and we will discuss why soon! However, we will be using the idea of splitting a given data set into several smaller data sets very often in this course, so learning how to do this now will serve us well when we get to cross-validation in a few classes.

Using Random Seeds

The first step in the process of splitting a data set into test and training data is to determine which rows will go in the training data, and which will go in the test data. We will use random sampling to help us to do this. Put the following code in a chunk and press play.

set.seed(100)
trainRows <- sample(1:333, 200)

The code chunk above has two lines. The first line of code, set.seed(100), is one we will use quite a bit in this class. We are going to use random sampling to choose the rows for our training data. This means that we are going to ask the computer to randomly sample 200 of our 333 rows. However, suppose you close your Markdown file and come back to it later. We want the computer to choose the SAME 200 rows when you run your code again. Otherwise, you would get completely different results each time you drew a random sample! The set.seed() function is what ensures that each time you run your chunk, you will get the same random sample. Let's try that.

Create a code chunk. Use the code sample(1:10, 2) to print out 2 random numbers between 1 and 10. Hit play on the chunk. What numbers do you get? Now, hit play again. What numbers do you have now?

You should notice that every time you hit play on this chunk, you get a different sample. This is what would happen if you closed your Markdown and re-opened it, or if you gave your code to someone else to run. This is not something we want to have happen, so we set a random seeds to fix this problem.

Now, add the line set.seed(435) to the beginning of your code chunk from the previous question (meaning this line needs to come before the sample command.) You will note that I used 435, but you can use literally any positive integer you want as your random seed. Hit play on the chunk. What numbers do you get? Now, hit play again. What numbers do you have now?

Now we notice that no matter how often we run the chunk, we get the same values. Yes!! This means that we can close our R and come back to it later, and our results will not change. This also means that we can send our code to someone else, and they will get the same random sample that we did. This means that setting a seed can help make your code reproducible.

Setting a random seed (which is what we will call using the set.seed() command) will prove very useful for any kind of random sampling we do in this course.

Sampling in R

Now, let's go back to the code chunk we started with:

set.seed(100)
trainRows <- sample(1:333, 200)

We have now explored the first line, but what does the second line do? Well, sample(1:333, 200) will draw 200 random numbers between 1 and 333 (1:333).

Set a random seed of 367. Show the code you would need to sample 5 random numbers between 5 and 678.

Okay, back to our code chunk. Once we have drawn our 200 random numbers, we store that output in a vector called trainRows. The <- operator is what stores the result. Where does it store it? Well, look in the upper right corner of your RStudio screen. You will see a vector called trainRows.

trainRows <- sample(1:333, 200)

Set a random seed of 245. Show the code you would need to sample 5 random numbers between 5 and 200, and store the results as trainPractice. Which 5 random numbers did you select?

Okay, so right now trainRows is a vector that contains 200 numbers between 1:333. These are the rows we want to then grab from the penguins data and use as our training data. To do this, we use

trainPenguins <- penguins[trainRows,]

Again, you will notice the use of the storage operator <-, but this time we are not storing a vector. We are instead storing a data frame (a data set). We have reached into the penguins data and pulled all the rows that we selected in trainRows. We used these rows to create a new (smaller) data set called trainPenguins which should have 200 rows. Note that [trainRows, ] indicates that we want the 200 rows in trainRows, and we want all the columns. If we wanted only column 1, for instance, we would use [trainRows, 1].

Using the rows indicated in the trainPractice vector you created in Question 4, create a data frame (data set) called trainPracticeData by selecting only the rows from the penguin data that are indicated in trainPractice. Print out the result by typing trainPracticeData in a chunk and pressing play.

Okay, now we have our training data. This is 200 of the original 333 rows. What do we do with the 133 rows that were not chosen for training data? The rows that were not selected for our training data will be in our test data set. To choose all the rows that were not in trainRows, we use:

testPenguins <- penguins[-trainRows,]

The - operator means "not" or "except"; grab all the rows from the penguins data set EXCEPT those we already put in the training data.

Check to make sure trainPenguins has 200 rows and testPenguins has 133 rows.

Now that we have our test data and training data, we have created a few objects we no longer need. To clean up our work space in R, we can remove these objects if we wish. To do so, we use the code rm:

rm(trainRows,trainPractice,trainPracticeData)

Exploring the Data

We have information on 8 variables in the training data:

species - the type penguin.
island - the island where the penguin lives.
body_mass_g - the mass of the penguin in grams.
bill_length_mm - the length of the penguin bill in millimeters.
bill_depth_mm - the depth of the penguin bill in millimeters.
flipper_length_mm - the flipper length of the penguin in millimeters.
sex - the biological sex of the penguin.
year - the year the penguin was measured.

We are going to focus on only 3 to start:

body_mass_g - the mass of the penguin in grams.
bill_length_mm - the length of the penguin bill in millimeters.
sex - the biological sex of the penguin.

Our response variable will be sex, and our task is to try to use body mass and bill length to predict the sex of the penguin.

Using the training data, create a scatter plot with bill length on the x axis and body mass on the y axis, and color the points on the graph (or change the shape, or both) based on the penguin's sex. Make sure you use labels and titles.

Take a look at the penguin with the largest body mass on the graph. Take a look at the 5 nearest neighbors to this penguin. What are the sexes of these penguins? Based on this, using 5 nearest neighbors, what sex would we predict for this penguin?

Using KNN

Now, let's use KNN to predict the sex of the penguin. The functions we need for KNN are in the class library in R. Go ahead and load the library you need.

suppressMessages(library(class))

The actual function we will use for KNN is called (shockingly!!) knn( ). It takes a few arguments (or inputs we need to make the functions run).

knn(train = , test =  , cl = , k = )

train = : Here, you will put the training data. We are using the 3rd and 6th columns of the training data for our features, so we use train = trainPenguins[,c(3,6)]
test = : Here, you will put the test data.
cl = : Here, you give the response variable from the training data. Example: cl = trainPenguins$sex
k = : Here, you provide the integer value you would like to use for k. How many nearest neighbors should we use?

Using 5 nearest neighbors, run KNN. Store the results under the name predictions. Hint: Look back at previous questions if you need to remember how to do this! If you don't store the results, you will get 133 values printing out on your screen, which is not what we want!

Once we get predictions, we need to check to see how accurate they are. We are making predictions on our test data, which has Y values in it, so we know what the values of Y are supposed to be. This means we can check to see if our predictions are correct!

One common way of comparing predictions of your response variable to the actual values of the response variable is to use a confusion matrix. To make a confusion matrix, you will make a table with the rows containing the predictions and the columns containing the true values of Y. To make your confusion matrix, you can use:

knitr::kable( table("Predictions" = predictions , "Actual" = testPenguins$sex), col.names =c("Female (Actual)", "Male(Actual)"), caption = "Table 1: Predictions (Rows) vs. Truth(Columns)")

How do we read this? Well, the first column contains the female penguins in the test data. The penguins we predicted as being female are in the first row. So, the first row, first column intersection indicates the penguins which we correctly predicted as female. Similarly, the penguins in the second row, second column, intersection are the penguins we correctly predicted as being male.

Create a confusion matrix by using the code above. State your (1) sensitivity, (2) specificity, and (3) classification error rate.

Now, re-run KNN using K = 10. Create and show a new confusion matrix. State your (1) sensitivity, (2) specificity, and (3) classification error rate.

Based on your results, which choice of K (5 or 10) would you prefer and why?

Now, run KNN using Y = island. This means we are predicting the island the penguins came from. Show a confusion matrix, and state the CER. Note: This is just to show that we can use KNN when Y has more than 2 levels.

Before you submit

One last step before we knit. Look at the very first chunk in your Markdown file. You should see something like knitr::opts_chunk$set(echo = TRUE). For this lab, I need to see your code, so make sure you see this. Typically, we change this to knitr::opts_chunk$set(echo = FALSE) to hide all the code you have created from your final document, but I need to see your code for today.

Next Steps

We have explored some tools in R (like random seeds and sampling) and we have explored the KNN code. We will explore more about this classification technique soon!

Turning in your assignment

You must submit a PDF document to Canvas. No other formats will be accepted.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 January 19.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.

The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .