STA 363/663 Lab 1

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

The Goal

We are going to be using R extensively in this course. Because of this, our lab today is designed to remind us of some of the codes that we will be using frequently, including coding for data visualization. Most of the coding should feel familiar.

We will also practice some of the vocabulary terms we covered in our first class. These terms will be used throughout the semester, so let’s take some time to practice them now!

Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.

Clearing the Work space

You learned R in STA 112. This means that when you open RStudio, the “data environment” (the upper right hand panel of RStudio) where all our data sets are stored may be very full! If your Environment is blank, don’t worry about it! If it’s not blank, there is a way to clear your Environment, i.e., remove data sets that you no longer need.

To clear all of the data in the upper right hand panel, look at the top of the panel and find an image that looks like a small broom. Pushing that will “clean” your space, meaning that it will remove all the contents within the panel. It is good practice to do this before you start each lab. This keeps us from having a lot of unnecessary stuff cluttering up this window, making the data sets you need easier to find.

Installing Packages

We will access data from a variety of sources in this course, but today we will be loading data from inside an R package. A package is a collection of codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package.This means that our first step is to remind ourselves how to install an R package.

To install a package:

1. Look at the top of your RStudio screen and find “Tools”.
1. From the drop down menu, choose “Install Packages”.
1. In the white box, type palmerpenguins and then hit install.

Repeat the process above with the package corrplot.

We will need several packages as you move through this course, and you will use these same steps anytime you need to install a new one. Luckily you only have to install each package once!!!

Loading Packages

Now that we have the package installed, we need to tell R to actually use that package. To do that, we need to create a space in our RMarkdown file to create code. Remember, Markdown can do three things: (1) create regular text, (2) run code, and (3) create math equations. Right now, we want (2).

To tell RMarkdown we are about to give it code, there are two ways to do it.

Option 1: Look at the top of your Markdown file, and find Code. Click it. From the drop-down menu, choose Insert Chunk. Click it! A gray box should appear in your Markdown file. This is a chunk.
Option 2: In the top gray bar of your Markdown file, look for a small green C (shown below). Click that, and choose R! A gray box should appear in your Markdown file. This is a chunk.

When you are done, you should have a code chunk that looks like this:

Anything we put inside a chunk (that gray box) will be treated as computer code. Go ahead and put the code below inside a chunk.

suppressMessages( library(ggplot2) ) 
suppressMessages( library(corrplot) )
suppressMessages( library(palmerpenguins) )

Now, look at the right hand side of the chunk and find the little green triangle symbol. We will call this the play button in this course. Go ahead and press the play button (press play). This tells R to run the code!

When you run this code, it will look like nothing happens…and that’s because all we told R to do was get ready to use these packages. With such a command, nothing will show up on our screen, and that’s okay!

Note: In the code above, you’ll notice I have suppressMessages around each library. This is not strictly necessary, but if you don’t include it, R prints out messages / comments about the libraries when you load them in, for example: Attaching package: dplyr. This is annoying when we want to have professionally formatted files, so suppressMessages just tells R we don’t want to see those.

The Data

Our data set for today contains information on n = 333 penguins from three different species. We have a client who is interested in building a model for $Y$ = the body mass of a penguin in grams. To load the data, (1) create a chunk, (2) paste in the code below, and (3) press play.

data("penguins")
penguins <- na.omit( penguins )

Now that we have data, we are ready to use it to start answering our lab questions! We will use Markdown (or Quarto if you prefer) files in our course to submit labs. When you are answering a lab question, your set up should look like this:

The ## allows your Question numbers of show up in bold so I can easily find them for grading. The * puts your text in italics. Let’s try it!

Question 1

Based on the information we have so far, how do we know that we are dealing with a supervised learning task?

Question 2

Based on the information we have so far, are we dealing with a regression or classification modeling task?

So, there’s 3 vocab terms already:

1. Supervised Learning
1. Classification Task
1. Regression Task

Starting next week, there will be a list of vocabulary terms on Canvas to show which terms we have covered so far. Hopefully, this will be helpful for studying!

In addition to the response variable, the data set contains information on 7 features:

species - the type penguin.
island - the island where the penguin lives.
bill_length_mm - the length of the penguin bill in millimeters.
bill_depth_mm - the depth of the penguin bill in millimeters.
flipper_length_mm - the flipper length of the penguin in millimeters.
sex - the biological sex of the penguin.
year - the year the penguin was measured.

Question 3

Client 1 is interested in determining the relationship between $X$ = the flipper length and $Y$ = the body mass of a penguin. For Client 1, are we dealing with a prediction or association task?

Question 4

Client 2 would like us to build a model that can be used to estimate the body mass of penguins once other certain characteristics, such as flipper length, are measured. For Client 2, are we dealing with a prediction or association task?

Now, we could use one model to try and suit the needs of both clients, or we could build two separate models, one designed for each task. As statisticians, it is our job to determine what is appropriate, and what is possible with our resources. More on that later.

For the rest of the lab, we are working on Client 2’s request: predicting body mass.

We have just gotten started with the analysis, and we have already run through our vocab list:

1. Supervised Learning
1. Classification Task
1. Regression Task
1. Association Task
1. Prediction Task
1. Feature

These are key terms you will be hearing constantly as we move through the course. If you have any questions about them, please let me know!

Exploratory Data Analysis (EDA)

Once we have the data and know the goal of the analysis, the next step is to start to look at the data. We refer to the process of exploring the data as exploratory data analysis, or EDA. This means using graphs, charts, tables, and other such approaches to explore the data.

Let’s start by creating a summary of our response variable. In R, to tell the computer we wish to work with a single column in a data set (like body_mass_g), we use $. So, to tell the computer “Please go into the penguins data set and look at only the body mass column”, we use penguins$body_mass_g. This means that to get a summary of that single column, we use:

summary(penguins$body_mass_g)

Question 5

Based on the summary, what is the smallest body mass in the data set?

Question 6

Use a summary to find out which species of penguin is most popular in the data set. Show your summary and state the most popular species as the answer to this question.

Exploring the response variable and features is a really important step before we do any modeling. We need to make sure we don’t have any outliers that might be a problem, or missing data to contend with, etc., before we start to do any modeling.

As we discussed in class, there are three things that we always need to check before using features for modeling.

1. We need to exclude any non-numeric columns that are unique for every row - such columns cannot be used as features.
1. For any categorical features, we need to check to make sure there is enough data in each level for modeling. Ideally, we want about 10% of the data in each level of the categorical features.
1. We need to check for multicollinearity in our features.

Question 7

Check 1) and 2). Briefly describe how you check them and state whether you find any issues.

Multicollinearity: Numeric Features

The next step is to check 3) multicollinearity. Checking multicollinearity means checking for features that are perfectly or strongly related to one another.

If there are two perfectly related features, we need to remove one feature before we build a predictive model - or any kind of model for that matter! If you put two perfectly correlated features into the model, you are inputting the same information twice, and the model can only use it once. For example, we might have weight in pounds and weight in kgs. Different numbers, but they mean the same thing when converted!

What about if features are strongly related to one another, but they are not perfectly related? As we move through STA 363, we will learn various models and algorithms that are equipped to handle such strongly (but not perfectly!) related features. However, right now we are dealing with linear regression models. For linear regression models, if two features are strongly related, we usually remove one.

To see why, consider a dataset with $n = 100$ rows with 2 features, $X$ and $Z$, where $X$ and $Z$ are highly correlated.

\[Cor(X, Z) = 0.9998081\]

In this data set, there is a strong positive relationship between $X$ and the response variable $Y$.

\[Cor(X, Y) = 0.953206\]

There is also a strong positive relationship between $Z$ and $Y$.

\[Cor(Z,Y) = 0.9527016\]

However, if we put both $X$ and $Z$ into a model, look at what happens:

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.523	0.979	-0.534	0.595
x	-1.383	1.051	-1.316	0.191
z	0.199	0.105	1.892	0.061

The model suggests that the relationship between $X$ and $Y$ is negative, even though it is not!!

This is because when two strongly related features are put into the model, we are double counting the effect of that feature in the model. This means that typically, one of the two coefficients in the model flips signs (in this case becoming negative) to offset that double counting. This leads to coefficients that do not make sense when interpreting, and it can lead to less accurate predictions when using the model.

One way to check for multicollinearity in numeric features is to use a plot called a correlelogram. To create such a plot, we need the following code:

corrplot( cor( data with only numeric features ) , type = "upper", method = "number")

Here, we would replace data with only numeric features with only the numeric columns in our data set. For instance, if columns 1, 2, and 5 are numeric in a data set called data, we would use:

corrplot( cor( data[ , c(1, 2, 5) ] ), type = "upper", method = "number" )

Question 8

Which columns in penguins are numeric?

Question 9

Create and show the correlogram for penguins.
Do we have an issues with multicollinearity with any numeric features for these data? In other words, do we have any numeric features that are strongly correlated with one another? Briefly explain/justify your answer.

Multicollinearity: Categorical Features

Correlelograms allow us to check for linear multicollinearity with numeric features, but our data set also contains categorical features. We cannot compute a correlation coefficient for categorical features, but it is still important to check for relationships among categorical features. The easiest way to do this is by using tables.

There are several ways to make tables in R, but we will discuss two. The first is very direct. We tell R we want to use the penguins data set and the variable species by using the code penguins$species (dataset$variable). Then, we use the table(whatWeWantToMakeATableWith) command to actually make the table.

table(penguins$species)

However, this makes a table that is not particularly pretty or professional when you knit. A second option that does create professional tables is:

knitr::kable(table(penguins$species), col.names=c("Species", "Count") )

The code is more complex, but the heart of it is the same table. This table will not look very pretty when you press play, but go ahead and knit. See how nicely the table gets formatted?

Question 10

Create a table, using the second way to make a table, for the island where the penguins live. Label the columns appropriately.

Okay, so now we can summarize the data by looking at the islands and the species. However, the goal was to look at them together to check for a relationship between the variables. To do this, we create a table with two categorical variables.

knitr::kable(table(penguins$species, penguins$island))

Question 11

The table above shows that species and island are related. In 1 sentence, explain to your scientist client how you can tell this by looking at the table.

Question 12

Because the two features are related, our client wants to know if we should (a) include the feature island in the model, (b) include the feature species in the model, or (c) include both features. What would you recommend and why?

Question 13

Check to see if there are multicollinearity issues with

1. sex and species
1. sex and island

If there is no multicollinearity, just state that there are no multicollinearity concerns with these features. If there are multicollineaity concerns, tell your client how you would suggest handling them.

As we use different methods and models in this course, we will see that some are equipped to handle correlated features and some are not. This is part of why EDA is so important before we choose a model. We need to know what traits our data set has in order to choose a model that might be appropriate!

Training a Model

Now that we have checked our features, let’s train a regression model to predict $Y$ = body mass (in grams) using all the features (excluding any you think we need to remove to deal with multicollinearity).

As a reminder from STA 112, we can build a regression model using:

model <- lm( body_mass_g ~ species + island + bill_length_mm + bill_depth_mm + 
                   flipper_length_mm + sex + year, data = penguins)

The lm tells R we are building a linear regression model. To see the coefficients, we use:

knitr::kable( summary(model)$coefficients)

Question 14

Adapt the code above to remove any features you suggested removing during the multicollinearity check. Show the table of coefficients (formatted using the code above) as your answer to this question.

In STA 112, we would now start looking at the coefficients and interpreting them. However, today and almost always in STA 363, our goal is prediction. This means we want to use the features in the data to predict $Y$, which in this case is body mass of a penguin.

To use our model to make predictions, we would take the traits of each penguin and plug them into the model we just trained. We can do this for all the rows in penguins at once using predict:

yhat <- predict( model )

You will notice here that I have used an arrow (<-). In code, this structure means:

Box <- what you store in the box

In other words, an arrow means put whatever is on the right in an object named on the left. In our case, we just stored our predictions under the name yhat. That’s going to be very important now, because now we are going to use these predictions.

Assessing Prediction Accuracy: RMSE

The final step in our prediction process is to assess the accuracy of our predictions. Let $y_i$ be the true value of body mass for row $i$ in the data set. Let $\hat{y}_i$ be the value we predicted for body mass for row $i$ in the data set using our model. We want to see how close $y_i$ is to $\hat{y}_i$.

We have $n= 333$ rows in our data set. This means that ideally, we would like some metric (numeric tool) that allows us to summarize how well we are predicting for all rows. A common metric we use for this is called the root mean squared error (RMSE).

\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\]

Hmm, this looks like a lot. Let’s break it down.

Part 1

\[y_i - \hat{y}_i\] This is the distance between each true value of body mass $y_i$ and the predicted value $\hat{y}_i$. In other words, it’s the residual for row $i$!

Part 2

\[(y_i - \hat{y}_i)^2\]

We square the values to get rid of any negative numbers; we don’t want a large negative value and a large positive value to cancel each other out, as each would represent a large difference between the truth and the prediction.

Part 3

\[\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] Hey, this should look familiar! It looks just like the residual sum of squares (RSS), which is the sum of all residuals, squared.

Part 4

\[MSE = {\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\] The next step is to divide the RSS by the number of rows $n$. This gives us the mean squared error (MSE), which is the average squared prediction error, meaning the average squared distance between the true value and the prediction. We don’t want the squared difference, though. So…

Part 5

\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\]

We take the square root of the MSE to get the root mean squared error (RMSE). This gives us a measure that describes, on average, how far away the predictions are from the truth.

Question 15

Based on the definition of the RMSE, if we are using RMSE to assess predictive accuracy, do larger or smaller values of the RMSE indicate more accurate predictions?

As we move through the semester, we will learn several different metrics we can use to assess the performance of a model. Each will have a different job and a different interpretation, so it is important to think through how the metric is built so we understand what it means.

Let’s build a quick function to compute the RMSE so we don’t have to it by hand. Some of it is filled in, but you need to fill in the rest!

compute_RMSE <- function(truth, predictions){

  # Part 1
  part1 <- truth - predictions
  
  #Part 2
  part2 <- 
  
  #Part 3: RSS 
  part3 <- 
  
  #Part4: MSE
  part4 <- 1/length(truth)
  
  # Part 5:RMSE
  sqrt(        )
}

To use the function, you will run,

compute_RMSE( truth , predictions)

where you replace truth with the true values of body mass in the penguins data set (hint: we have learned how to pull just one column in a data set in this lab!) and predictions with the predicted body masses (hint: we just make these predictions!! Where did we store them?)

Question 16

Complete the function above (let Dr. Dalzell know if you get stuck!!) and use it to compute the RMSE for predictions for body mass in our penguins data.

Question 17

Interpret the RMSE from the previous question in the context of our data set.

Assessing Prediction Accuracy: Plot

In addition to the RMSE, it is also a good idea to look at a graph of the predictions $\hat{Y}$ versus the true values of $Y$. This allows us to see where we are predicting well, if there seem to be regions we are not predicting well, etc.

Question 18

We have two numeric variables: predicted body mass and true body mass. What type of plot could we use to examine the relationship between these two?

To create the plot we need, paste the following code in a chunk and hit play.

ggplot(penguins, aes(x = yhat , y=body_mass_g)) +
  geom_point(col = 'blue')

The creation of plots in ggplot2 requires building the plot in layers. First, we build the background, the grid on which we will be building our graph. This is the job of the ggplot() part of the code. Once we have built the background, we are ready to plot our data.

The command we will use for this depends upon the data type we are working with. In this case, the command we use is geom_point(). We specify that we want the dots to be colored ( 'col') blue.

Notice that to add each layer to the graph, in the code we use a plus sign. We add the background AND THEN the dots to make the final graph.

Question 19

Change the color of the points on the graph purple and show the resulting plot.

We can add on another layer to our plot by adding x and y axis labels, as well as a title for the plot. The command labs, which stands for labels, is used for this.

ggplot(penguins, aes(x = yhat , y=body_mass_g)) +
  geom_point(col = 'blue') +
  labs(title="Figure 1:", x = "x axis label",
  y = "y axis label", caption = "caption")

This ggplot syntax actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Finally, you add a label. Thinking through the steps in this manner will help you understand the syntax of this package.

Question 20

Adapt the code above to add the following labels:

X axis: Predicted Body Mass
Y axis: True Body Mass
Caption: Comparing predicted and true penguin body mass (in grams)

One VERY important thing to remember when we make plots is to make sure the axes are easily interpretable by your reader. You do not want to use default variable names, like “body_mass_g”. Instead, we want clear labels like “Body Mass (in grams)”. This is going to be a requirement for all graphs you make in this course - label your axes appropriately and title your graphs.

The very last step is to add on a 0-1 line. This is a line with a slope of 1 and an intercept of 0. Any points on the line represent a perfect prediction.

ggplot(penguins, aes(x = yhat , y=body_mass_g)) +
  geom_point(col = 'blue') +
  labs(title="Figure 1:", x = "x axis label",
  y = "y axis label", caption = "caption") +
  geom_abline(intercept = 0, slope = 1 , col = "black" )

Question 21

Adapt your code from Question 20 to add the 0-1 line, but make the line any color except black, white, or yellow.

Now that we have our graph, let’s think about what information it yields.

Question 22

If a point on the graph is above the 0-1 line, what does that tell us?

Question 23

If a point on the graph is below the 0-1 line, what does that tell us?

Question 24

Does it look like there are any true values of body mass that we predict better or worse for than others? For example, do we seem to predict better for lighter penguins, or for heavier penguins, or does the predictive ability of the model seem about the same regardless of the penguin mass?

Before you submit

Almost done!! However, there is one more thing we need to do before we knit. Look at the very first chunk in your Markdown file. You should see something like:

knitr::opts_chunk$set(echo = TRUE)

Change this to:

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)

What this will do is hide all the code you have created from your final document. All your plots will still show up, but your code will not.

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly!!!

References

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 December 17.

The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .