STA 112 Lab 3

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

The Goal

Last lab, we focused on learning to fit a LSLR Line (i.e., find the values of $\hat{\beta}_0$ and $\hat{\beta}_1$) using R. We looked specifically at coffee data, and went through the steps of building and interpreting the LSLR line.

It turns out that the choice to use the LSLR model involves some assumptions. We have already discussed the fact that we really shouldn’t use a linear model if the relationship between X and Y isn’t, well, linear. In other words, in a graph, the relationship between X and Y should look like a line. This is called the form (shape) condition, and it is one of the first things we check when we choose to use LSLR.

Today, we are going to explore alternative modeling choices when the shape of the relationship is not a line.

Remember that now is a good time to clear your Environment if you have not already! We created a lot of objects in the last lab, and today’s work will be easier if your Environment is clean.

The Data

The data we are working with today is on bluegills, a type of fish that live in fresh water all throughout North America. We have a client who has provided us with data on $n = 170$ bluegill fish. The client wants to try and understand how the length of the fish changes with age. This information is important because bluegills are popular fish to catch for sport fishing.

Our client is in charge of determining how big a fish has to be in order for someone who catches a bluegill to keep the fish rather than having to put it back in the water. This required minimum length is called the length minimum. In other words, a fish that is longer than the length minimum can be kept by the individual who caught the fish, but any fish with a length below the length minimum must be put back in the water.

The client knows that a bluegill reproduces the most during the first 4 years of life, so it is important that fish who are 4 years old or younger are not removed from the water. This helps ensure the species will continue to thrive in the wild.

You have been asked by the client (1) to determine what the length minimum should be and (2) to describe the relationship between X = fish age and Y = fish length.

Download the data

The data that you need can be downloaded from Canvas. Once you have downloaded the data, you need to move it into R.

This can be done using the same steps as Lab 2. I’ve put a brief reminder of those steps below.

Step 1 Look at the upper right hand panel of your RStudio screen (the Environment tab).
Step 2 Find “Import Dataset” or “Import” and click on it.
Step 3 Choose “Text (base)” or “From CSV” (it will depend on your computer).
Step 4 Find your data (bluegills.csv) in the list that comes up. Choose it!
Step 5 Now, look the bottom right hand panel of your screen. You should see a line of code with something like bluegills <- read.csv("SOME STUFF HERE") or bluegills <- read.csv("SOME STUFF HERE").
Step 6 Copy that ENTIRE line of code.
Step 7 In your Markdown file, insert a code chunk.
Step 8 Paste the line from Step 5 into this gray code chunk, and press the green arrow (the play button).

And you are ready to go!! You now have the data loaded and you can work with it!

We are going to do one more formatting step today today to get you ready for your project. Look at the first chunk in your Markdown. You will see:

knitr::opts_chunk$set(echo = TRUE)

Change it to

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)

What this does is hide your code when you knit your document. Your code will still run, and all your results will show up, but things like lm(y~x) will not show up. This allows us to take our Markdown files and create professional documents!

EDA

The first step in starting any analysis is exploratory data analysis (EDA), which generally means making plots to explore the data. We have to really take time to understand what is going on in the data. For this data set, we only have two columns, so this process will focus on those two variables. The length of the fish in millimeters is our response (Y) variable.

Question 1

Use one visualization (i.e., a plot you think is appropriate) to explore the relationship between the X = age of a fish (in years) and Y = the length (in millimeters). Make sure you have labeled your axes “Fish Age in years” and “Fish Length in mm”, and title your graph Figure 1.

Question 2

Describe the relationship between the two variables.

We have a specific goal when working with these data. How long do we expect fish to be if they are more than 4 years old? In other words, given the age of the fish (X), can we predict the length (Y)? We hope to then use this value to determine the length minimum.

Since we know we are interested in looking at a particular point on the x axis (Age = 4), it can be helpful to add a vertical line to our scatter plot at Age = 4. Since we are adding a vertical line, the code we will use is + geom_vline(xintercept = ) at the end of our scatter plot from Question 1. Right after the = sign in xintercept =, type the value on the x axis that you want the vertical line (vline) to go through.

Question 3

Create the same graph as you did in Question 1, but now add a vertical line to your graph at Age = 4. Make sure you have labeled your axes “Fish Age” and “Fish Length”, and title your graph Figure 2. Choose to either (1) change the color(col) of the line or (2) make the line dashed by adding lty = 2. For option 2, your code will look like geom_vline(xintercept = , lty = 2)

Note: You can use numbers other than 2 to choose a different line type. You can also change the thickness of the line by adding size = 1.5 or some other value greater than 1.

Question 4

Based strictly on the graph, what length might you choose as the length minimum? Explain your choice. Note: There are many “right” answers here, the point is to see your thinking!

Question 5

Create the same graph as you did in Question 3, but now add a horizontal line to your graph at the length minimum you have suggested. Make sure you have labeled your axes “Fish Age in years” and “Fish Length in mms”, and title your graph Figure 3.

Hint: We added a vertical line with geom_vline, where the v is for vertical. We then specified where on the x axis we wanted the line to hit with xintercept =. Now you want a horizontal line and a y intercept.

By looking at our graph, we can take a guess at the length minimum. However, to get a more exact answer, we want to build a statistical model for the relationship between fish age (X) and fish length (Y).

Question 6

Based on what we have so far, does it look like the shape of the relationship indicates that LSLR might be appropriate?

Considering Other Options

Luckily, LSLR is not the only type of regression model we can build. There are a lot of other shapes we can use other than a line!

Question 7

Based on what we have so far, what might be a more appropriate regression model to use, rather than LSLR?

In the question above, you have proposed a possible regression model choice. This choice involves a shape that is non-linear. When we are considering using a specific shape for the model, it is often helpful to graph the shape so that we can see if the choice we have made is reasonable for the data we have.

If we want to add the LSLR line to a scatter plot, we add +stat_smooth( formula = y ~ x, method = "lm", se= FALSE), as we discussed in the last lab.

If want to add on a curve for something like polynomial regression, we instead add +stat_smooth( formula = y ~ poly(x, NUMBER), method = "lm", se= FALSE), where we replace NUMBER with the degree of the polynomial we want to use.

Question 8

Create the same graph as you did in Question 1, but add on the fitted model you have proposed in the previous question. Make sure to label your axes, and title your graph Figure 4. Does this shape seem (visually) like a reasonable choice? Hint: If you are stuck, look back at the slides from last class.

Okay, so now we have decided on the type of regression model we want to use! The next step is to build the model. This will allow us to predict how long we expect a fish to be at age 4 so we can answer our client’s question.

However, before we build the model, there is one important skill in R we need to practice. This skill will be incredibly helpful for using our output when we build models.

Storing in R

When we last built a regression model in R, we used a code like this:

lm( y ~ x, data = dataset )

When we ran this code, the estimates of $\hat{\beta}_0$ and $\hat{\beta}_1$ appeared. While this is very useful, it turns out that there is more information we can get from building a regression model in R. We can get things like the $R^2$, the residuals, and a lot of other things! However, getting access to all of this easily involves storing the results of our modeling.

In R, storing means that we assign a piece of output to a name. For example, suppose we want R to assign the number 4 to the name z. We can do that with this code:

z <- 4

In this code, we have given R an object (4) and we have told it to assign that object (<-) to the letter z. In other words, we have just told R that $z = 4$. If you now create chunk and type z, R will print out 4!

## [1] 4

Question 9

If you were to type 4*z + 2 into a chunk, what output should you get when you run the chunk (when you press play)?

Question 10

Assign the letter m to the number 6. Show the output you get when you run 5*m + 4 in a chunk.

You can store more complicated objects than just numbers. Let’s try it out by building an LSLR model for fish age (X) and fish length (Y). Yes, I know the relationship is not really linear!! You will be adapting this code to reflect the actual model in just a minute.

To create an LSLR model, we run the code:

lm( length ~ age , data = bluegills)

To store that model output under the name LSLRmodel, we use the following:

LSLRmodel <- lm( length  ~ age , data = bluegills)

If you create a chunk, type in LSLRmodel and press play, you will see the estimate of $\hat{\beta}_0$ and $\hat{\beta}_1$. Wait…didn’t we get that before?? So what have we gained?

Well, when you run lm in R, it does a lot more than just compute $\hat{\beta}_0$ and $\hat{\beta}_1$. It also stores the residuals, which we can retrieve using $:

LSLRmodel$residuals

So, we can now reach into our object (LSLRmodel) and retrieve ($) the residuals (residuals).

Question 11

What is the residual for the first row in the data set?

We can also obtain the values of $\hat{Y}$ for every row in the data set, meaning we can find the y-coordinate of the LSLR line for every row in the data set, by using:

LSLRmodel$fitted.values

Question 12

What is the value of $\hat{y}_1$ for the first row in the data set?

Note: If you have the code LSLR$residuals and LSLR$fitted.values in chunks your Markdown file, that means that right now your final lab will contain 340 numbers printed out. We do NOT want that! To hide long output, we put a # in front of the code to tell R not to print out the result of running the code. That means we don’t end up with long strings of numbers in our final document!

#LSLRmodel$residuals
#LSLRmodel$fitted.values

Creating the actual regression model

In the previous section, we built an LSLR model just as an example so we could see how to create and store objects. Now, let’s build the actual model that we think is appropriate.

Question 13

Look back at Question 8, when you choose a shape that you think is appropriate to represent the relationship between X and Y. This shape should correspond to one of the types of regression models we have learned in class.

Build your chosen model in R, and store it under the name fishmodel. As your answer to this question, write out the fitted model (that means including the numeric estimates for any $\hat{\beta}$ values).

Hint: Using the code fishmodel$coefficients is one way to see the $\hat{\beta}$ values.

Question 14

Using your model from Question 13, what is the residual for the 3rd row in the data set? Yes, you can use code to find this rather than computing it out!

Assessing Model Fit

Before we use this model to make predictions, we need to be able to describe the strength of our model to the client. This is also called describing how well our model fits the data. This is very important because if our model is a poor fit, any conclusions we draw from it will likely be unreliable.

As we learned in class, one way to assess the strength of the model (how well the model fits the data) is by computing the $R^2$.

\[R^2 = 1 -\frac{RSS}{TSS}\]

Question 15

Why is the RSS needed to compute the $R^2$? In other words, what does the RSS measure and why is it needed to compute the $R^2$?

To find the $R^2$ for our model, put the following code in a chunk and press play:

summary(fishmodel)

You will notice that we get a lot of output when we run this command, including the $R^2$, listed as Multiple R-squared.

Question 16

Your client asks you tell them what the value of $R^2$ means and why knowing this is helpful for them. Provide a short reply.

Note: This counts as the first half of your 2nd Writing Check in this course. This means there is no specific answer I’m looking for here, this will just be graded on completion.

When we interpret things in statistics, there are generally two ways we do it. We can interpret formally, using specific notation and terminology. This is really important for communicating clearly with other statistical experts and data scientists, especially when we are building new methodology or deeply exploring a method.

However, when we talk to a client, we generally need to avoid terms that are potentially confusing or misleading to someone whose expertise is not in statistics or data science.

Look at your answer to Question 16. Do you see the terms “variance”, “variability”, or “X”, “Y”? Likely the answer is yes, because our formal definition does include these terms! However, we now need to refine our writing style to make sure we are able to explain clearly to the client.

This means that when we are talking to a client, memorizing definitions is not what is needed. Instead, we need to be able to explain concepts in a variety of ways to suit the client’s needs and background while connecting specifically to their data question. If an explanation we provide is not clear, we need to be ready to pivot and explain in a different way. Employers look for this skill in technical interviews for jobs and internships. This is something AI cannot do, and it is how we bring our skills to the table.

Question 17

Re-write your answer to Question 16 to avoid statistical terminology like “variance”, “variability”, etc., and make sure to specifically connect your answer to the client’s goals and data set.

Note: This counts as the second half of your 2nd Writing Check in this course. Your grade will just be based on whether you improved your answer to Question 16 to focus on the client.

Making Recommendations

With our model built, we are ready to fulfill our client’s request by making our recommendation on the value for the length minimum.

Question 18

Using your fitted model, what is the predicted fish length for a 4 year old bluegill? Show your work with appropriate notation.

Question 19

Based on all of this, what length do you recommend our client use as the required length a bluegill must be before they are allowed to be kept if caught? Explain why you chose this value.

Last step

Question 20

We have been in class together for a few weeks, and I would like your feedback on how the course is going so far. Please fill out this quick survey for me. Your answers are anonymous.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 September 16.