STA 363 Lab 3

Complete all Questions and submit your final PDF or html under Assignments in Canvas.

The Goal

Today, we are going to return to our penguin data from Lab 1. Our goal is to practice creating test data from training data.

The Data

To load the data, put the following two lines of code inside a chunk and press play:

library(palmerpenguins)
data("penguins")

As a reminder, this data set contains information on n = 344 penguins. However, there is some missing data. We will begin by removing the rows with missing data.

penguins <- na.omit(penguins)

Our goal for today is to predict the sex of a penguin based on features of that penguin. Birds in particular are very difficult to distinguish in terms of sex, especially when the birds are young, so such a predictive model can be very useful to ecologists.

Now, as we discussed in class, we typically are presented with two different data sets when our goal is prediction. The first, the training data set, is used to conduct EDA and to train our model. The second, the test data set, is used to test the accuracy of our model.

Here, we are only provided with one data set. As we have discussed in class, there are three different ways we can “create” test data. We are going to try one of these in our lab today. Specifically, we are going to manually divide the data into two data sets: a test data set and a training data set.

Note: This technique is NOT something would do in real life with a data set this small.

Question 1

What problems occur when we split the training data set into two data sets, especially with a data set this small?

Why are we doing it, then?? Well, it is easier to learn coding techniques when the data set we are working with is smaller. This means that this is a good data set to practice on, even if in reality we would not want to use this technique.

Using Random Seeds

The first step in the process of splitting a data set into test and training data is to determine which rows will go in the training data, and which will go in the test data. We will use random sampling to help us to do this. Put the following code in a chunk and press play.

set.seed(100)
trainRows <- sample(1:333, 200)

The code chunk above has two lines. The first line of code, set.seed(100), is one we will use quite a bit in this class. We are going to use random sampling to choose the rows for our training data. This means that we are going to ask the computer to randomly sample 200 of our 333 rows. However, suppose you close your Markdown file and come back to it later. We want the computer to choose the SAME 200 rows when you run your code again. Otherwise, you would get completely different results each time you drew a random sample! The set.seed() function is what ensures that each time you run your chunk, you will get the same random sample. Let’s try that.

Question 2

Create a code chunk. Use the code sample(1:10, 2) to print out 2 random numbers between 1 and 10. Hit play on the chunk. What numbers do you get? Now, hit play again. What numbers do you have now?

You should notice that every time you hit play on this chunk, you get a different sample. This is what would happen if you closed your Markdown and re-opened it, or if you gave your code to someone else to run. This is not something we want to have happen, so we set a random seeds to fix this problem.

Question 3

Now, add the line set.seed(435) to the beginning of your code chunk from the previous question (meaning this line needs to come before the sample command.) You will note that I used 435, but you can use literally any positive integer you want as your random seed. Hit play on the chunk. What numbers do you get? Now, hit play again. What numbers do you have now?

Now we notice that no matter how often we run the chunk, we get the same values. Yes!! This means that we can close our R and come back to it later, and our results will not change. This also means that we can send our code to someone else, and they will get the same random sample that we did. This means that setting a seed can help make your code reproducible.

Setting a random seed (which is what we will call using the set.seed() command) will prove very useful for any kind of random sampling we do in this course.

Sampling in R

Now, let’s go back to the code chunk we started with:

set.seed(100)
trainRows <- sample(1:333, 200)

We have now explored the first line, but what does the second line do? Well, sample(1:333, 200) will draw 200 random numbers between 1 and 333 (1:333).

Question 4

Set a random seed of 367. Show the code you would need to sample 5 random numbers between 5 and 678.

Okay, back to our code chunk. Once we have drawn our 200 random numbers, we store that output in a vector called trainRows. The <- operator is what stores the result. Where does it store it? Well, look in the upper right corner of your RStudio screen. You will see a vector called trainRows.

set.seed(100)
trainRows <- sample(1:333, 200)

Question 5

Set a random seed of 245. Show the code you would need to sample 5 random numbers between 5 and 200, and store the results as trainPractice. Which 5 random numbers did you select?

Okay, so right now trainRows is a vector that contains 200 numbers between 1:333. These are the rows we want to then grab from the penguins data and use as our training data. To do this, we use:

trainPenguins <- penguins[trainRows,]

Again, you will notice the use of the storage operator <-, but this time we are not storing a vector. We are instead storing a data frame (a data set). We have reached into the penguins data and pulled all the rows that we selected in trainRows. We used these rows to create a new (smaller) data set called trainPenguins which should have 200 rows. Note that [trainRows, ] indicates that we want the 200 rows in trainRows, and we want all the columns. If we wanted only column 1, for instance, we would use [trainRows, 1].

Question 6

Using the rows indicated in the trainPractice vector you created in Question 4, create a data frame (data set) called trainPracticeData by selecting only the rows from the penguin data that are indicated in trainPractice. Print out the result by typing trainPracticeData in a chunk and pressing play.

Okay, now we have our training data. This is 200 of the original 333 rows. What do we do with the 133 rows that were not chosen for training data? The rows that were not selected for our training data will be in our test data set. To choose all the rows that were not in trainRows, we use:

testPenguins <- penguins[-trainRows,]

The - operator means “not” or “except”; grab all the rows from the penguins data set EXCEPT those we already put in the training data.

Check to make sure trainPenguins has 200 rows and testPenguins has 133 rows.

Now that we have our test data and training data, we have created a few objects we no longer need. To clean up our work space in R, we can remove these objects if we wish. To do so, we use the code rm:

rm(trainRows,trainPractice,trainPracticeData)

Make sure you ask if you have any questions about this lab!!! You will be using all of these coding concepts to do your project, so it is important that these ideas make sense to you before you start your project.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 September 5.

The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .