Last lab, we focused on learning to fit a LSLR Line (i.e., find the values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)) using R. We looked specifically at coffee data, and went through the steps of building and interpreting a LSLR line.
It turns out that the choice to use a LSLR model involves some assumptions. We have already discussed the fact that we really shouldn't use a LSLR model if the relationship between Y and X isn't, well, linear. In other words, in a graph, the relationship between Y and X should look like a line. This is called the shape condition, and it is one of the first things we check when we choose a LSLR model.
Today, we are going to explore alternative modeling choices when the shape of the relationship is not a line.
Remember that now is a good time to clear your Environment if you have not already! We created a lot of objects in the last lab, and today's work will be easier if your Environment is clean.
The data we are working with today is on bluegills, a type of fish that lives in fresh water all throughout North America. We have a client who has provided us with our data. The client wants to try and understand how the length of the fish changes with age. This information is important because bluegills are popular fish to catch for sport fishing. Our client is in charge of determining how big a fish has to be in order for someone who catches a bluegill to keep the fish rather than putting it back in the water. This required minimum length is called the length minimum. The client knows that a bluegill reproduces the most during the first 4 years of life, so it is important that fish who are 4 years old or younger are not removed from the water. This helps ensure the species will continue to thrive in the wild.
You have been asked by the client (1) to determine what the length minimum should be and (2) to describe the relationship between fish age and fish length.
The data that you need can be downloaded from Canvas. Remember that the steps you need to load the data can be found in Lab 2. Ask if you get stuck!
Note that this is a good time to try the echo=FALSE
tip for saving space in your final document. At this point in the course, you all know how to load the data, and we don't need that to show up in your knitted document. At the top of the R chunk where you have loaded the data, change the {r} command to {r, echo = FALSE}. Knit and make sure your document compiles.
The first step in starting any analysis is exploratory data analysis (EDA). We have to really take time to understand what is going on in the data. For this data set, we only have two columns, so this process will focus on those two variables. The length of the fish in centimeters is our response (Y) variable.
We have a specific goal when working with these data. How long do we expect fish to be if they are more than 4 years old? We hope to then use this value to determine the length minimum.
Since we know we are interested in looking at a particular point on the x axis (Age = 4), it can be helpful to add a vertical line to our scatter plot at Age = 4. Since we are adding a vertical line, the code we will use is + geom_vline(xintercept = )
at the end of our scatter plot from Question 1. Right after the = sign in xintercept =
, type the value on the x axis that you want the vertical line (vline
) to go through.
col
) of the line or (2) make the line dashed by adding lty = 2
. For option 2, your code will look like geom_vline(xintercept = , lty = 2)
.
Note: You can use numbers other than 2 to choose a different line type. You can also change the thickness of the line by adding size = 1.5
or some other value greater than 1.
geom_vline
, where the v is for vertical. We then specified where on the x axis we wanted the line to hit with xintercept =
. Now you want a horizontal line and a y intercept.By looking at our graph, we can take a guess at the length minimum. However, to get a more exact answer, we want to build a statistical model for the relationship between fish age (X) and fish length (Y).
Based on what we have so far, it does not look like we want to use LSLR. However, we do still want to use regression modeling if possible.
When we are considering using a specific shape for the model, it is often helpful to graph the shape so that we can see if the choice we have made is reasonable for the data we have.
Visuals are wonderful, but now we need to write out our actual fitted model. This will allow us to predict how long we expect a fish to be at age 4 so we can answer our client's question.
Before we use this model to make predictions, we need to be able to describe to our client how well our model fits the data. This is very important because if our model is a poor fit, any conclusions we draw from it will likely be unreliable.
To assess model fit, we need to perform an outlier analysis, as well as computing the \(R^2\) for your fitted model. Let's begin with the outlier analysis.
Now that we have completed the outlier analysis, we can work on determining how well the model fits the data.
We are now ready to make our recommendation on fish length.
Now, one more step. This predicted value comes from a sample of data. We don't really expect all fish of 4 years of age to have exactly the same length! This means that in general, we prefer a range of values to a single prediction.
For any given value of an explanatory variable, we define a prediction interval as range of plausible values for \(\hat{Y}\). We will talk about the mathematics behind creating these soon, but for now, let's use R to help us create a prediction interval.
Let's suppose I want to make a 95% prediction interval for fish who are 3.5 years old, and suppose my model is called fishmodel. The code I would use is:
predict(fishmodel , data.frame("age" = 3.5), interval = "predict", level = .95)
The result that you get looks like this:
fit lwr upr
1 146.2744 121.5592 170.9896
The first piece of the output, fit
, is the predicted value of length when fish age is 3.5. The second piece of output, lwr
, is the lower bound of the 95% prediction interval. The final piece of output, upr
, is the upper bound of the 95% prediction interval.
This all means that we are 95% confident that for a fish of age 3.5 years, the length of the fish should be between 121.56 and 170.99 centimeters long.
We are now ready to make our recommendation on fish length.