STA 112 Lab 3

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

The Goal

Last lab, we focused on learning to fit a LSLR Line (i.e., find the values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)) using R. We looked specifically at coffee data, and went through the steps of building and interpreting the LSLR line.

It turns out that the choice to use the LSLR model involves some assumptions. We have already discussed the fact that we really shouldn’t use a linear model if the relationship between X and Y isn’t, well, linear. In other words, in a graph, the relationship between X and Y should look like a line. This is called the form (shape) condition, and it is one of the first things we check when we choose to use LSLR.

Today, we are going to explore alternative modeling choices when the shape of the relationship is not a line.

Remember that now is a good time to clear your Environment if you have not already! We created a lot of objects in the last lab, and today’s work will be easier if your Environment is clean.

The Data

The data we are working with today is on bluegills, a type of fish that lives in fresh water all throughout North America. We have a client who has provided us with our data on \(n = 170\) bluegill fish. The client wants to try and understand how the length of the fish changes with age. This information is important because bluegills are popular fish to catch for sport fishing.

Our client is in charge of determining how big a fish has to be in order for someone who catches a bluegill to keep the fish rather than having to put it back in the water. This required minimum length is called the length minimum. In other words, a fish that is longer than the length minimum can be kept by the individual who caught the fish, but any fish wig a length below the length minimum must put the fish back in the water.

The client knows that a bluegill reproduces the most during the first 4 years of life, so it is important that fish who are 4 years old or younger are not removed from the water. This helps ensure the species will continue to thrive in the wild.

You have been asked by the client (1) to determine what the length minimum should be and (2) to describe the relationship between X = fish age and Y = fish length.

Download the data

The data that you need can be downloaded from Canvas. Once you have downloaded the data, let’s move it into R!

  • Step 1 Look at the upper right hand panel of your RStudio screen (the Environment tab).
  • Step 2 Find “Import Dataset” or “Import” and click on it.
  • Step 3 Choose “Text (base)” or “From CSV” (it will depend on your computer).
  • Step 4 Find your data (bluegills.csv) in the list that comes up. Choose it!
  • Step 5 Now, look the bottom right hand panel of your screen. You should see a line of code with something like bluegills <- read.csv(“bluegills.csv”) or bluegills <- read.csv(“bluegills.csv”).
  • Step 6 Copy that ENTIRE line of code.
  • Step 7 In your Markdown file, insert a code chunk.
  • Step 8 Paste the line from Step 5 into this gray code chunk, and press the green arrow (the play button).

And you are ready to go!! You now have the data loaded and you can work with it!

Note that this is a good time to try the echo=FALSE tip for saving space in your final document. At this point in the course, you all know how to load the data, and we don’t need that to show up in your knitted document. At the top of the R chunk where you have loaded the data, change the {r} command to {r, echo = FALSE}. Knit and make sure your document compiles.

EDA

The first step in starting any analysis is exploratory data analysis (EDA). We have to really take time to understand what is going on in the data. For this data set, we only have two columns, so this process will focus on those two variables. The length of the fish in centimeters is our response (Y) variable.

Question 1

Use one visualization (i.e., a plot you think is appropriate) to explore the relationship between the X = age of a fish (in years) and Y = the length (in centimeters). Make sure you have labeled your axes “Fish Age in years” and “Fish Length in cm”, and title your graph Figure 1.

Question 2

Describe the relationship between the two variables.

We have a specific goal when working with these data. How long do we expect fish to be if they are more than 4 years old? In other words, given the age of the fish (X), can we predict the length (Y)? We hope to then use this value to determine the length minimum.

Since we know we are interested in looking at a particular point on the x axis (Age = 4), it can be helpful to add a vertical line to our scatter plot at Age = 4. Since we are adding a vertical line, the code we will use is + geom_vline(xintercept = ) at the end of our scatter plot from Question 1. Right after the = sign in xintercept =, type the value on the x axis that you want the vertical line (vline) to go through.

Question 3

Create the same graph as you did in Question 1, but now add a vertical line to your graph at Age = 4. Make sure you have labeled your axes “Fish Age” and “Fish Length”, and title your graph Figure 2. Choose to either (1) change the color(col) of the line or (2) make the line dashed by adding lty = 2. For option 2, your code will look like geom_vline(xintercept = , lty = 2)

Note: You can use numbers other than 2 to choose a different line type. You can also change the thickness of the line by adding size = 1.5 or some other value greater than 1.

Question 4

Based strictly on the graph, what length might you choose as the length minimum? Explain your choice. Note: There are many “right” answers here, the point is to see your thinking!

Question 5

Create the same graph as you did in Question 3, but now add a horizontal line to your graph at the length minimum you have suggested. Make sure you have labeled your axes “Fish Age in years” and “Fish Length in cms”, and title your graph Figure 3.

Hint: We added a vertical line with geom_vline, where the v is for vertical. We then specified where on the x axis we wanted the line to hit with xintercept =. Now you want a horizontal line and a y intercept.

By looking at our graph, we can take a guess at the length minimum. However, to get a more exact answer, we want to build a statistical model for the relationship between fish age (X) and fish length (Y).

Question 6

Based on what we have so far, does it look like the shape of the relationship indicates that LSLR might be appropriate? Comment on why or why not.

Considering Other Options

Luckily, LSLR is not the only type of regression model we can build. There are a lot of other shapes we can use other than a line!

Question 7

Based on what we have so far, what might be a more appropriate regression model to use, rather than LSLR? Why?

In the question above, you have proposed a possible regression model choice. This choice involves a shape that is non-linear. When we are considering using a specific shape for the model, it is often helpful to graph the shape so that we can see if the choice we have made is reasonable for the data we have.

If we want to add the LSLR line to a scatter plot, we add +stat_smooth( formula = y ~ x, method = "lm", se= FALSE), as we discussed in the last lab.

If want to add on a curve for something like polynomial regression, we instead add +stat_smooth( formula = y ~ poly(x, NUMBER), method = "lm", se= FALSE), where we replace NUMBER with the degree of the polynomial we want to use.

Question 8

Create the same graph as you did in Question 1, but add on the fitted model you have proposed in the previous question. Make sure to label your axes, and title your graph Figure 4. Does this shape seem (visually) like a reasonable choice? Hint: If you are stuck, look back at the slides from last class.

Okay, so now we have decided on the type of regression model we want to use! The next step is to build the model. This will allow us to predict how long we expect a fish to be at age 4 so we can answer our client’s question.

However, before we build the model, there is one important skill in R we need to practice. This skill will be incredible helpful for using our output when we build models.

Storing in R

When we last built a regression model in R, we used a code like this:

lm( y ~ x, data = dataset )

When we ran this code, the estimates of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) appeared. While this is very useful, it turns out that there is more information we can get from building a regression model in R. We can get things like the \(R^2\), the residuals, the leverage, and a lot of other things! However, getting access to all of this easily involves storing the results of our modeling.

In R, storing means that we assign a piece of output to a name. For example, suppose we want R to assign the number 4 to the name z. We can do that with this code:

z <- 4

In this code, we have given R an object (4) and we have told it to assign that object (<-) to the letter z. In other words, we have just told R that \(z = 4\). If you now create chunk and type z, R will print out 4!

Question 9

If you were to type 4*z + 2 into a chunk, what output should you get when you run the chunk (when you press play)?

Question 10

Assign the letter m to the number 6. Show the output you get when you run 5*m + 4 in a chunk.

You can store more complicated objects than just numbers. Let’s try it out by building an LSLR model for fish age (X) and fish length (Y). Yes, I know the relationship is not really linear!! You will be adapting this code to reflect the actual model in just a minute.

To create an LSLR model, we run the code:

lm( length ~ age , data = bluegills)

To store that model output under the name LSLRmodel, we use the following:

LSLRmodel <- lm( length  ~ age , data = bluegills)

If you create a chunk, type in LSLRmodel and press play, you will see the estimated of \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Wait…didn’t we get that before?? So what have we gained?

Well, when you run lm in R, it does a lot more than just compute \(\hat{\beta}_0\) and \(\hat{\beta}_1\). It also stores the residuals, which we can retrieve using $:

LSLRmodel$residuals

So, we can now reach into our object (LSLRmodel) and retrieve ($) the residuals (residuals).

Question 11

What is the residual for the first row in the data set?

We can also obtain the values of \(\hat{Y}\) for every row in the data set, meaning we can find the y-coordinate of the LSLR line for every row in the data set, by using:

LSLRmodel$fitted.values

Question 12

What is the value of \(\hat{y}_1\) for the first row in the data set?

Note: If you have the code LSLR$residuals and LSLR$fitted.values in chunks your Markdown file, that means that right now your final lab will contain 340 numbers printed out. We do NOT want that! To hide long output, we put a # in front of the code to tell R not to print out the result of running the code. That means we don’t end up with long strings of numbers in our final document!

#LSLRmodel$residuals
#LSLRmodel$fitted.values

Creating the actual regression model

In the previous section, we built an LSLR model just as an example so we could see how to create and store objects. Now, let’s build the actual model that we think is appropriate.

Question 13

Look back at Question 8, when you choose a shape that you think is appropriate to represent the relationship between X and Y. This shape should correspond to one of the types of regression models we have learned in class.

Build your chosen model in R, and store it under the name fishmodel. As your answer to this question, write out the fitted model (that means including the numeric estimates for any \(\hat{\beta}\) values).

Hint: Using the code fishmodel$coefficients is one way to see the \(\hat{\beta}\) values.

Question 14

Using your model from Question 13, what is the residual for the 3rd row in the data set? Yes, you can use code to find this rather than computing it out!

Assessing Model Fit

Before we use this model to make predictions, we need to be able to describe to our client how well our model fits the data. This is very important because if our model is a poor fit, any conclusions we draw from it will likely be unreliable.

As we learned in class, one way to assess how well the model fits the data is by computing the \(R^2\).

Question 15

Compute the RSS for your fitted model. Hint: See class notes for code.

Question 16

Compute the TSS for your fitted model. Hint: See class notes for code.

Question 17

Compute and interpret the \(R^2\) for your fitted model.

Great!! We can compute \(R^2\) by hand!! But is there a faster way?? Yes!

Put the following code in a chunk and press play:

summary(fishmodel)

You will notice that we get a lot of output when we run this command, including the \(R^2\)! It turns out that this format is the one we are going to be using going forward because the output contains information we will need to perform inference on our regression model.

Making Recommendations

With our model built, we are ready to fulfill our client’s request by making our recommendation on the value for the length minimum.

Question 18

Using your fitted model, what is the predicted fish length for a 4 year old bluegill? Show your work with appropriate notation.

Question 19

Based on all of this, what length do you recommend our client use as the required length a bluegill must be before they are allowed to be kept if caught? Explain why you chose this value.

Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 August 21.