Today, we are going to apply KNN to our penguin data from Lab 1. Our goal is to practice using the coding for KNN, as well practicing a few key skills in R and assessing prediction accuracy.
To load the data, put the following two lines of code inside a chunk and press play:
library(palmerpenguins)
data("penguins")
As a reminder, this data set contains information on n = 344 penguins. However, there is some missing data. KNN (and indeed most models) cannot be applied when there is any missing data, so we will begin by removing the rows with missing data.
penguins <- na.omit(penguins)
Our goal for today is to predict the sex of a penguin based on features of that penguin. Birds in particular are very difficult to distinguish in terms of sex, especially when the birds are young, so such a predictive model can be very useful to ecologists.
Now, as we discussed in class, we typically are presented with two different data sets when our goal is prediction. The first, the training data set, is used to conduct EDA and to train our model. The second, the test data set, is used to test the accuracy of our model.
Here, we are only provided with one data set.To allow us to practice, we are going to manually divide the data into test and training data sets. Note: This is NOT something we do in real life, and we will discuss why soon! However, we will be using the idea of splitting a given data set into several smaller data sets very often in this course, so learning how to do this now will serve us well when we get to cross-validation in a few classes.
The first step in the process of splitting a data set into test and training data is to determine which rows will go in the training data, and which will go in the test data. We will use random sampling to help us to do this. Put the following code in a chunk and press play.
set.seed(100)
trainRows <- sample(1:333, 200)
The code chunk above has two lines. The first line of code, set.seed(100)
, is one we will use quite a bit in this class. We are going to use random sampling to choose the rows for our training data. This means that we are going to ask the computer to randomly sample 200 of our 333 rows. However, suppose you close your Markdown file and come back to it later. We want the computer to choose the SAME 200 rows when you run your code again. Otherwise, you would get completely different results each time you drew a random sample! The set.seed()
function is what ensures that each time you run your chunk, you will get the same random sample. Let's try that.
sample(1:10, 2)
to print out 2 random numbers between 1 and 10. Hit play on the chunk. What numbers do you get? Now, hit play again. What numbers do you have now?You should notice that every time you hit play on this chunk, you get a different sample. This is what would happen if you closed your Markdown and re-opened it, or if you gave your code to someone else to run. This is not something we want to have happen, so we set a random seeds to fix this problem.
set.seed(435)
to the beginning of your code chunk from the previous question (meaning this line needs to come before the sample
command.) You will note that I used 435, but you can use literally any positive integer you want as your random seed. Hit play on the chunk. What numbers do you get? Now, hit play again. What numbers do you have now?Now we notice that no matter how often we run the chunk, we get the same values. Yes!! This means that we can close our R and come back to it later, and our results will not change. This also means that we can send our code to someone else, and they will get the same random sample that we did. This means that setting a seed can help make your code reproducible.
Setting a random seed (which is what we will call using the set.seed()
command) will prove very useful for any kind of random sampling we do in this course.
Now, let's go back to the code chunk we started with:
set.seed(100)
trainRows <- sample(1:333, 200)
We have now explored the first line, but what does the second line do? Well, sample(1:333, 200)
will draw 200 random numbers between 1 and 333 (1:333
).
Okay, back to our code chunk. Once we have drawn our 200 random numbers, we store that output in a vector called trainRows. The <-
operator is what stores the result. Where does it store it? Well, look in the upper right corner of your RStudio screen. You will see a vector called trainRows
.
trainRows <- sample(1:333, 200)
trainPractice
. Which 5 random numbers did you select?Okay, so right now trainRows
is a vector that contains 200 numbers between 1:333. These are the rows we want to then grab from the penguins data and use as our training data. To do this, we use
trainPenguins <- penguins[trainRows,]
Again, you will notice the use of the storage operator <-
, but this time we are not storing a vector. We are instead storing a data frame (a data set). We have reached into the penguins data and pulled all the rows that we selected in trainRows
. We used these rows to create a new (smaller) data set called trainPenguins
which should have 200 rows. Note that [trainRows, ]
indicates that we want the 200 rows in trainRows
, and we want all the columns. If we wanted only column 1, for instance, we would use [trainRows, 1]
.
trainPractice
vector you created in Question 4, create a data frame (data set) called trainPracticeData
by selecting only the rows from the penguin data that are indicated in trainPractice
. Print out the result by typing trainPracticeData
in a chunk and pressing play.Okay, now we have our training data. This is 200 of the original 333 rows. What do we do with the 133 rows that were not chosen for training data? The rows that were not selected for our training data will be in our test data set. To choose all the rows that were not in trainRows
, we use:
testPenguins <- penguins[-trainRows,]
The -
operator means "not" or "except"; grab all the rows from the penguins data set EXCEPT those we already put in the training data.
Check to make sure trainPenguins
has 200 rows and testPenguins
has 133 rows.
Now that we have our test data and training data, we have created a few objects we no longer need. To clean up our work space in R, we can remove these objects if we wish. To do so, we use the code rm
:
rm(trainRows,trainPractice,trainPracticeData)
We have information on 8 variables in the training data:
species
- the type penguin.island
- the island where the penguin lives.body_mass_g
- the mass of the penguin in grams.bill_length_mm
- the length of the penguin bill in millimeters.bill_depth_mm
- the depth of the penguin bill in millimeters.flipper_length_mm
- the flipper length of the penguin in millimeters.sex
- the biological sex of the penguin.year
- the year the penguin was measured.We are going to focus on only 3 to start:
body_mass_g
- the mass of the penguin in grams.bill_length_mm
- the length of the penguin bill in millimeters.sex
- the biological sex of the penguin.Our response variable will be sex, and our task is to try to use body mass and bill length to predict the sex of the penguin.
Now, let's use KNN to predict the sex of the penguin. The functions we need for KNN are in the class
library in R. Go ahead and load the library you need.
suppressMessages(library(class))
The actual function we will use for KNN is called (shockingly!!) knn( )
. It takes a few arguments (or inputs we need to make the functions run).
knn(train = , test = , cl = , k = )
train =
: Here, you will put the training data. We are using the 3rd and 6th columns of the training data for our features, so we use train = trainPenguins[,c(3,6)]
test =
: Here, you will put the test data.cl =
: Here, you give the response variable from the training data. Example: cl = trainPenguins$sex
k =
: Here, you provide the integer value you would like to use for k. How many nearest neighbors should we use?predictions
. Hint: Look back at previous questions if you need to remember how to do this! If you don't store the results, you will get 133 values printing out on your screen, which is not what we want!Once we get predictions, we need to check to see how accurate they are. We are making predictions on our test data, which has Y values in it, so we know what the values of Y are supposed to be. This means we can check to see if our predictions are correct!
One common way of comparing predictions of your response variable to the actual values of the response variable is to use a confusion matrix. To make a confusion matrix, you will make a table with the rows containing the predictions and the columns containing the true values of Y. To make your confusion matrix, you can use:
knitr::kable( table("Predictions" = predictions , "Actual" = testPenguins$sex), col.names =c("Female (Actual)", "Male(Actual)"), caption = "Table 1: Predictions (Rows) vs. Truth(Columns)")
How do we read this? Well, the first column contains the female penguins in the test data. The penguins we predicted as being female are in the first row. So, the first row, first column intersection indicates the penguins which we correctly predicted as female. Similarly, the penguins in the second row, second column, intersection are the penguins we correctly predicted as being male.
One last step before we knit. Look at the very first chunk in your Markdown file. You should see something like knitr::opts_chunk$set(echo = TRUE)
. For this lab, I need to see your code, so make sure you see this. Typically, we change this to knitr::opts_chunk$set(echo = FALSE)
to hide all the code you have created from your final document, but I need to see your code for today.
We have explored some tools in R (like random seeds and sampling) and we have explored the KNN code. We will explore more about this classification technique soon!