This lab is a little different from other labs. The goal today is to get practice in the kind of work you will be doing in Project 3. This means building a multiple linear regression (MLR) model, where there are multiple possible predictors to consider. This means that the questions are more open ended than usual.
To load the data, put the following two lines of code in a chunk and press play:
library(palmerpenguins)
data("penguins")
penguins<-na.omit(penguins[,-2])
This data set contains information on n = 333 penguins from three different species. We have a client who is interested in building a model for Y = the body mass of a penguin in grams.
In addition to this response variable, we have information on:
species
- the type penguin.bill_length_mm
- the length of the penguin bill in millimeters.bill_depth_mm
- the depth of the penguin bill in millimeters.flipper_length_mm
- the flipper length of the penguin in millimeters.sex
- the biological sex of the penguin.year
- the year the penguin was measured.The client would like us to build a model that can be used to explain to other scientists what features are related to the body mass of penguins. Our client is not interested in the year variable, as they have prior research that suggests that penguin size has not changed during the study period. This means you should consider including all of the potential explanatory variables except for year in the model for the client.
Our very first step when we have a modeling task is to perform EDA. In this case, we have 5 possible explanatory variables and one response variable, so the first step is to look at the relationship between each explanatory variable and the response.
This process is going to involve making a lot of graphs. Because of this, let's take a second and talk about how to format our graphs for RMarkdown.
Suppose I want to make a histogram of body mass. The code I need for that is:
ggplot(penguins, aes(body_mass_g)) + geom_histogram(bins = 20, fill = "blue", col = "white")
If I put that in a chunk and press play, one histogram will appear on my screen. Let's suppose I want to show a box plot along with this histogram. I can create both graphs and have them both print out separately in my knit document. However, I can also tell the computer to print more than one graph at once to save space. Try the following:
g1 <- ggplot(penguins, aes(body_mass_g)) + geom_histogram(bins = 20, fill = "blue", col = "white")+ labs(title ="Figure 1")
g2 <- ggplot(penguins, aes(body_mass_g)) + geom_boxplot(fill = "white", col = "blue") + labs(title ="Figure 2")
gridExtra::grid.arrange(g1,g2, ncol = 2)
What you will notice is that we have stored each of the two graphs under a name. Our histogram is stored under g1
and our box plot is stored under g2
. Then, we use a special code called gridExtra::grid.arrange
to help us arrange the graphs in a grid. In our case, we have to graphs, and we want them side by side. This means we want the graphs in a 1 (row) by 2 (column) grid.
To create the 1 by 2 grid, we feed the computer our two graphs, and then tell it we want the figures in 2 columns (next to each other) by specifying ncol = 2
. In other words, the number of columns we want is 2!
Using these same two plots, change the code so that the graphs stack on top of each other, meaning they are displayed in a 2 (row) by 1 (column) grid. Show the graphs.
This technique of using a grid for the graphs makes our final product more professional, as well as saving space!
Another technique that is useful for creating graphs is changing the shape or color of points on a scatter plot to reflect different groups in the data. For instance, in the code below, I create two scatter plots for exploring the relationship between flipper length and body mass. In the first graph, all the dots are one color and one shape. In the second, the dots change shape and color depending on the sex of the penguin.
g1 <- ggplot(penguins, aes(flipper_length_mm,body_mass_g)) + geom_point()+ labs(title ="Figure 1", x="Flipper Length (mm)", y = "Body Mass (g)")
g2 <- ggplot(penguins, aes(flipper_length_mm,body_mass_g, col = sex, pch = sex)) + geom_point() + labs(title ="Figure 2", x="Flipper Length (mm)", y = "Body Mass (g)")
gridExtra::grid.arrange(g1,g2, ncol = 1)
In the code, col = sex
tells R to give each sex a different colored point on the graph, and pch = sex
tells R to give each sex a different shaped point on the graph.
What is the purpose of doing this? Well, it allows us to see if the relationship between Y and X differs for different categorical variables. Hint: We have seen in class recently how that might be used!
With these tips in mind, let's complete EDA for our data.
Conduct EDA to explore the relationship between our response variable and each potential predictor. (1) For each graph, describe the relationship between the two variables. (2) Based on what you see in the graphs, how will you consider including that variable in your MLR (multiple linear regression) model for the client? (3) Make sure you have used our new grid technique for your graphs! You don't have to stack them all together, but make sure that each graph is in a grid with at least one other graph.
The next thing we need to do is check for multicollinearity, which means relationships among our explanatory variables.
Between numeric variables, we check for multicollinearity in a few ways. Generally, we create plots to explore the relationships among the explanatory variables that we think might be related. If the relationship is linear, we can then use correlation to assess the strength of the relationship.
Conduct EDA to explore the relationship between our numeric explanatory variables (hint: there are 3). (1) For each graph, explain whether or not you think multicollinearity is a concern. (2) If the relationship is linear, use correlation to verify. If a correlation is above .6 or .7, we generally consider that multicollinearity could be a concern. (3) Make sure you have used our new grid technique for your graphs! You don't have to stack them all together, but make sure that each graph is in a grid with at least one another.
At this stage, we have done a very thorough job of EDA, and are ready to start building a model!
Based on what you found in EDA, write down (using appropriate notation) the population form of the model you will like to start with. Note: There is no one right answer here.
Fit your chosen model to the data and show the R output. What is the adjusted R-squared of your model?
Now, we have just created what is called a rough model. This is the starting place in model building. Our job now is to decide if we need to refine our model. Is there anything we want to try adding in? Removing? Changing? We can do this for a variety of reasons, including if our conditions for inference are not met.
We have learned some techniques in class for helping us compare models - this is the time to use them!
Perform any refining steps you would like, but you must consider comparing your rough model to at least one other potential model (so removing something, adding something, etc.). Show your work for choosing between the two models.
Write down your final fitted model. Check all necessary conditions for your model. Show your work.
At this stage, we have chosen our model. It is time to use it to help the client. Recall that the client asked us to build a model that can be used to explain to other scientists what features are related to the body mass of penguins, and how these features are related to body mass.
This lab has given us a first glimpse of model selection and model comparison, a topic you will be exploring further in Project 3.