Complete all Questions, and submit final documents in PDF or html form on Canvas.

The Goal

Last lab, we focused on learning to fit a LSLR Line (i.e., find the values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)) using R. We looked specifically at coffee data, and went through the steps of building and interpreting a LSLR line.

It turns out that the choice to use a LSLR model involves some assumptions. We have already discussed the fact that we really shouldn't use a LSLR model if the relationship between Y and X isn't, well, linear. In other words, in a graph, the relationship between Y and X should look like a line. This is called the shape condition, and it is one of the first things we check when we choose a LSLR model.

Today, we are going to explore alternative modeling choices when the shape of the relationship is not a line.

Remember that now is a good time to clear your Environment if you have not already! We created a lot of objects in the last lab, and today's work will be easier if your Environment is clean.

The Data

The data we are working with today is on bluegills, a type of fish that lives in fresh water all throughout North America. We have a client who has provided us with our data. The client wants to try and understand how the length of the fish changes with age. This information is important because bluegills are popular fish to catch for sport fishing. Our client is in charge of determining how big a fish has to be in order for someone who catches a bluegill to keep the fish rather than putting it back in the water. This required minimum length is called the length minimum. The client knows that a bluegill reproduces the most during the first 4 years of life, so it is important that fish who are 4 years old or younger are not removed from the water. This helps ensure the species will continue to thrive in the wild.

You have been asked by the client (1) to determine what the length minimum should be and (2) to describe the relationship between fish age and fish length.

The data that you need can be downloaded from Canvas. Remember that the steps you need to load the data can be found in Lab 2. Ask if you get stuck!

Note that this is a good time to try the echo=FALSE tip for saving space in your final document. At this point in the course, you all know how to load the data, and we don't need that to show up in your knitted document. At the top of the R chunk where you have loaded the data, change the {r} command to {r, echo = FALSE}. Knit and make sure your document compiles.

EDA

The first step in starting any analysis is exploratory data analysis (EDA). We have to really take time to understand what is going on in the data. For this data set, we only have two columns, so this process will focus on those two variables. The length of the fish in centimeters is our response (Y) variable.

  1. Use one visualization (i.e., a plot you think is appropriate) to explore the relationship between the age of a fish (in years) and the length (in centimeters). Make sure you have labeled your axes "Fish Age" and "Fish Length", and title your graph Figure 1.
  1. Describe the relationship between the two variables. Remember to comment on all four things we learned about last lab.

We have a specific goal when working with these data. How long do we expect fish to be if they are more than 4 years old? We hope to then use this value to determine the length minimum.

Since we know we are interested in looking at a particular point on the x axis (Age = 4), it can be helpful to add a vertical line to our scatter plot at Age = 4. Since we are adding a vertical line, the code we will use is + geom_vline(xintercept = ) at the end of our scatter plot from Question 1. Right after the = sign in xintercept = , type the value on the x axis that you want the vertical line (vline) to go through.

  1. Create the same graph as you did in Question 1, but now add a vertical line to your graph at Age = 4. Make sure you have labeled your axes "Fish Age" and "Fish Length", and title your graph Figure 2. Choose to either (1) change the color(col) of the line or (2) make the line dashed by adding lty = 2. For option 2, your code will look like geom_vline(xintercept = , lty = 2). Note: You can use numbers other than 2 to choose a different line type. You can also change the thickness of the line by adding size = 1.5 or some other value greater than 1.
  1. Based strictly on the graph, what length might you choose as the length minimum? Explain your choice.
  1. Create the same graph as you did in Question 3, but now add a horizontal line to your graph at the length minimum you have suggested. Make sure you have labeled your axes "Fish Age" and "Fish Length", and title your graph Figure 3. Hint: We added a vertical line with geom_vline, where the v is for vertical. We then specified where on the x axis we wanted the line to hit with xintercept = . Now you want a horizontal line and a y intercept.

By looking at our graph, we can take a guess at the length minimum. However, to get a more exact answer, we want to build a statistical model for the relationship between fish age (X) and fish length (Y).

  1. Based on what we have so far, does it seem like an LSLR model might be appropriate? Comment on why or why not.

Considering Other Options

Based on what we have so far, it does not look like we want to use LSLR. However, we do still want to use regression modeling if possible.

  1. Based on what we have so far, what might be a more appropriate form of the model to use, rather than a line? Why?

When we are considering using a specific shape for the model, it is often helpful to graph the shape so that we can see if the choice we have made is reasonable for the data we have.

  1. Create the same graph as you did in Question 1, but add on the fitted model you have proposed in the previous question. Make sure to label your axes, and title your graph Figure 4. Does this model seem (visually) like a reasonable choice? Hint: If you are stuck, look back at the slides from last class.

Visuals are wonderful, but now we need to write out our actual fitted model. This will allow us to predict how long we expect a fish to be at age 4 so we can answer our client's question.

  1. Fit your chosen model. Write out the fitted model (that means including the numeric estimates for any \(\hat{\beta}\) values).

Assessing Model Fit

Before we use this model to make predictions, we need to be able to describe to our client how well our model fits the data. This is very important because if our model is a poor fit, any conclusions we draw from it will likely be unreliable.

To assess model fit, we need to perform an outlier analysis, as well as computing the \(R^2\) for your fitted model. Let's begin with the outlier analysis.

  1. Create a graph of the studentized residuals for your fitted model. Make sure to label your x axis "Fish Age" and your y axis "Studentized Residuals", and title your plot Figure 5. Hint: This is just a scatter plot where the y-axis is the studentized residuals. Look back at the notes from Class 6 if you need help with this.
  1. (a) Are there any outliers? (b) Looking at Figure 4, do you think any of the outliers are influential? Explain.

Now that we have completed the outlier analysis, we can work on determining how well the model fits the data.

  1. Compute the RSS for your fitted model.
  1. Compute the TSS for your fitted model.
  1. Compute and interpret the \(R^2\) for your fitted model.

Making Recommendations

We are now ready to make our recommendation on fish length.

  1. Using your fitted model, what is the predicted fish length for a 4 year old bluegill? Show your work with appropriate notation.

Now, one more step. This predicted value comes from a sample of data. We don't really expect all fish of 4 years of age to have exactly the same length! This means that in general, we prefer a range of values to a single prediction.

For any given value of an explanatory variable, we define a prediction interval as range of plausible values for \(\hat{Y}\). We will talk about the mathematics behind creating these soon, but for now, let's use R to help us create a prediction interval.

Let's suppose I want to make a 95% prediction interval for fish who are 3.5 years old, and suppose my model is called fishmodel. The code I would use is:

predict(fishmodel ,  data.frame("age" = 3.5), interval = "predict", level = .95)

The result that you get looks like this:


      fit       lwr      upr
1 146.2744 121.5592 170.9896

The first piece of the output, fit, is the predicted value of length when fish age is 3.5. The second piece of output, lwr, is the lower bound of the 95% prediction interval. The final piece of output, upr, is the upper bound of the 95% prediction interval.

This all means that we are 95% confident that for a fish of age 3.5 years, the length of the fish should be between 121.56 and 170.99 centimeters long.

  1. Our client requests a 98% prediction interval for fish length when the fish is 4 years old. Create and interpret (see the sentence above for an example of interpreting) a 98% prediction interval for fish length at 4 years.

We are now ready to make our recommendation on fish length.

  1. Based on all of this, what length do you recommend our client use as the required length a bluegill must be before they are allowed to be kept if caught? Explain why you chose this value.
Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2020 Sept 21.
The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.
The data set used in this lab is the nc data set, provided by OpenIntro.