Evolutionary biologists are keenly interested in the characteristics that enable a species to withstand the selective mechanisms of evolution. An interesting variable in this respect is brain size. One might expect that bigger brains are better, but certain penalties seem to be associated with large brains, such as the need for longer pregnancies and fewer offspring. Although the individual members of the large brained species may have greater chance of surviving, the benefits for the species must be good enough to compensate for potential penalties. To shed some light on this issue, it is helpful to investigate what characteristics are associated with large brains, after getting the effect of body size out of the way. Body size is not something we can control for (hold fixed) in the study design, but we can use multiple linear regression to ``control" for it in the analysis (and this is the wonderful thing about MLR!).
To investigate this question, we will look average values of brain weight (kg), body weight (kg), gestation length (days), and litter size for 96 species of mammals (Data from G.A. Sacher and E.R. Staffeldt (1974), American Naturalist, 108, pp. 593-613). Mean average brain size of a species is obviously related to its average body size, and we will focus on investigating gestation length. Therefore, we rephrase the question as: “Is there evidence gestation length is associated with mean brain size, after accounting for body size?”
First, load data and look at summary of the different variables. Although these data are available in the Sleuth3 package we will practice loading data from a .csv file to remember how to do it! The file is called GestationBrainWeight.csv and we will assign it to the object brain_data, and it is found on D2L in the Lab1 content tab. Adjust the following code chunk to read in these data from wherever you save the file on your computer. Use summary, head, dim to gain insight into the data frame. Are there any interesting characteristics of the data frame?
There are no units of measurement for the brain and body masses. It is unclear if the units are consistent between these two metrics (which would be problematic because brain mass appears to be larger than body mass), as well as if the units of measurement are consistent between organisms. Upon reading chapter 9, we know that the brain weight is in grams while the body weight is in kilograms (Ramsey and Schafer, 2013). Looking at the data frame, we can see that organisms with larger brain and body weights also tend to have longer gestation lengths (days) and smaller litter sizes (but there are some exceptions).
library(ggplot2)
library(GGally)
# change the path as necessary - the csv is in the same
# file as this .Rmd on my computer, and the working directory
# in R is set to this location. To find your working directory
# uncomment and run:
# getwd()
# to change your working directory uncomment and run:
# setwd("PATH/TO/DESIRED/FOLDER")
# you can use tab to browse from the current working directory
# to the desired folder (or directory) on your computer!
brain_data <- read.csv("GestationBrainWeight.csv",
head = TRUE)
dim(brain_data)
## [1] 96 5
head(brain_data)
## Species Brain Body Gestation Litter
## 1 Aardvark 9.6 2.20 31 5.0
## 2 Acouchis 9.9 0.78 98 1.2
## 3 African elephant 4480.0 2800.00 655 1.0
## 4 Agoutis 20.3 2.80 104 1.3
## 5 Axis deer 219.0 89.00 218 1.0
## 6 Badger 53.0 6.00 60 2.2
summary(brain_data)
## Species Brain Body Gestation
## Length:96 Min. : 0.45 Min. : 0.017 Min. : 16.0
## Class :character 1st Qu.: 12.60 1st Qu.: 2.075 1st Qu.: 63.0
## Mode :character Median : 74.00 Median : 8.900 Median :133.5
## Mean : 218.98 Mean : 108.328 Mean :151.3
## 3rd Qu.: 260.00 3rd Qu.: 94.750 3rd Qu.:226.2
## Max. :4480.00 Max. :2800.000 Max. :655.0
## Litter
## Min. :1.00
## 1st Qu.:1.00
## Median :1.20
## Mean :2.31
## 3rd Qu.:3.20
## Max. :8.00
#Brain&Body = kg, Gestation = days, Litter = litter size
Create a matrix of scatterplots for the four variables: brain weight, body weight, gestation length, litter size.
# look at colnames of brain data to see which cols
# you need to use in the pairs plot
names(brain_data)
## [1] "Species" "Brain" "Body" "Gestation" "Litter"
# use the 2nd through 5th column to get the desired pairs plot
ggpairs(brain_data[,2:5],
upper = list(continuous = "points"),
lower = list(continuous = "blank"))
# don't display redundant info by letting lower panel be blank
It looks as though a natural log transformation would be helpful for limiting the pull of potential outliers. A natural log transformation would also help with spread. It appears that the scatter plots have various traits that would signify the need for log transformation such as a fanning patterns, right skewness, and higher averages with higher spreads.
Use ggpairs to create a matrix of scatter plots for three of the four variables on the natural logarithm scale (ln_brain, ln_body, ln_gest, and litter). You will need to first create new variables in the data frame called ln_brain, ln_body, and ln_gest, reflecting the log transformation. Then, use your plot to assess the following questions.
# create ln_brain
brain_data$ln_brain <- log(brain_data$Brain)
brain_data$ln_body <- log(brain_data$Body)
brain_data$ln_gest <- log(brain_data$Gestation)
ggpairs(brain_data[,6:8],
upper = list(continuous = "points"),
lower = list(continuous = "blank"))
The linear relationship between ln_body and ln_brain, ln_gest and ln_brain, and ln_gest and ln_body look more linear and less spread out with the log approximation. The normality also looks better with each curve being less round and looking less logarithmic. The response and explanatory variables look like a reasonable approximation with all three relationships being positive linear relationships.Specifically, with increased brain weight, there is also increased body weight and increased gestation length, and with increased body weight there is also increased gestation length.The only variable without a clear linear relationship in interactions is litter size. This suggests litter size is not a variable that will influence our final model.
There do not appear to be any strong outliers. There is one point in the top right corner of the graph that may be an outlier, but it does not appear to have high leverage since the overall trend would stay the same with or without the data-point present.
For the rest of this lab we will use the log-transformed variables but I will refer to them by their original names for simplicity.
Now, let’s think about what “controlling for” or “accounting for” body size really means in the context of a linear model. In the review notes we have looked at the easier situation, which is accounting for a categorical explanatory variable. It is natural to think of holding a categorical explanatory variable fixed at one of its values because it is just thinking about the relationship within each level (or group). That is, we think about varying the continuous explanatory variable while holding group constant — this results in changes in the intercept and slope (if an interaction is included) when moving among the groups.
When we have two continuous explanatory variables, the idea of holding one explanatory variable fixed while varying another gets a little trickier to think about. This is what is being quantified by our regression coefficients. It is helpful to think of artificially categorizing the variable we are trying to hold constant to define “sub-groups” or “sub-populations.” For example, we can translate the question of interest to: “Is there evidence that gestation length is associated with mean brain size within subgroups of species with similar body size?”
The variable used to define “subpopulations” in the first question is body size. By considering only animals of a similar body size together, this fixed variable allows us to examine levels of variables as categorical instead of continuous. By categorizing data points by weight, we are able to see a tighter and/or more clear relationship to the means of each sub-group. We are also able to see how each sub-group makes up the overall trend, and the influence each sub-group may have on the overall trend.
Below are theoretical linear models for a simple linear regrssion investigating the relationship between brain weight and gestation length, and a popoulation model for investigating the same relationship while accounting for body size. Explain what the coefficient for gestation length means in a simple linear regression model. Compare that to what it means in a model also including body weight as an explanatory variable. Why do you think I inculded the “extra” subscripts on the partial regression coefficients?
\[\begin{align*} SLR:\:\: brain_i &\overset{iid}{\sim} N(\mu_g, \sigma_g^2)\\ \mu_g\{brain|gest\} &= \beta_{0,g}+ \beta_{1,g}ln\_gest\\ Var\{brain|gest\} &= \sigma_{g}^2\\ MLR: \:\: brain_i &\overset{iid}{\sim} N(\mu_{gb}, \sigma_{gb}^2)\\ \mu_{gb}\{brain|gest, body\} &= \beta_{0,gb} + \beta_{1,gb}ln\_gest + \beta_{2,gb}ln\_body\\ Var\{brain|gest, body\} &= \sigma_{gb}^2\\ \end{align*}\]
In a simple linear regression model, the coefficient before gestation length represents the true change in the average value of y for a 1 unit increase in x. In a MLR model, the coefficient before gestation represents the adjustment from the mean of brain size to the mean of group 2 (gestation). Including the extra subscripts on the partial regression coefficients shows that the model considers gestation while also considering body weight when explaining the variability in y.
Let’s look at plots of the data to help understand these differences and to further investigate what “accounting for” really means. First, we will just look at the scatterplot of log of brain weight and log of gestation length, as well as the output from the associated simple linear regression model. (NOTE: the plot includes ``bands" constructed by connecting the ends of pointwise confidence intervals).
ggplot(brain_data, aes(y = ln_brain, x = ln_gest)) +
geom_point() +
geom_smooth(method = "loess", se = F,
colour = "green") +
geom_smooth(method = "lm")
options(show.signif.stars = FALSE)
lm_gest <- lm(ln_brain ~ ln_gest,
data = brain_data)
summary(lm_gest)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.664758 0.5616447 -11.86650 2.164129e-20
## ln_gest 2.233992 0.1172203 19.05805 4.727507e-34
summary(lm_gest)$sigma
## [1] 0.9896692
So, how much of the overall relationship between gestation length and mean brain weight (plot and model in part a) is simply because brain weight is clearly related to body weight, which is also related to gestation length (refer back to the pairs plot)? While there is nothing wrong with looking at the above relationship, it is not what the researchers were originally interested in. They wanted to investige gestation length while “controlling for” or “fixing” body weight. In other words, they wanted to get the body weight relationship out of the way before assessing the relationship with gestation length.
We don’t have multiple observations at each explanatory variable value (as we would if body weight was a categorical explanatory variable). Therefore, it’s not as clear where the information is coming from to estimate the relationship between mean brain weight and gestation length for species with the same body weights. Instead of the “same” body weights, we will consider mammals with similar body weights. To visualize and illustrate this, we will artificially categorize body weight so that we have many groups of mammals with similar body sizes that we can plot as a “group” and think about the slope within each group.
We will use lattice/panel plots and/or color/symbol coding by different body size categories to make a plot to illustrate the “accounting for” in this context. That is, we plot the data after “cutting” up the body weight variable so that we can look at the relationship between brain size and gestation length within each body weight category (or bin). The number and size of bins is very arbitrary and sensitivity to this should be investigated. We fit a simple linear regression model within each body weight category and include pointwise confidence intervals just to show the approximate uncertainty within each group.
Group Discussion: The researchers are interested in the relationship between gestation length and brain size while controlling for body weight. One reason they want to control for body weight is because body weight and brain size are independent but highly correlated (there is a positive linear relationship where increased brain size is correlated with increased body size) and both have a positive relationship with gestation length (see raw data plots). Since multicollinearity is occurring, it’s challenging to tell which variable is actually related to gestation length. Since we don’t have separate values for brain weight and body weight (we have one point representing the gestation length for both explanatory variable values associated with a species) for species with the same body weights, the color plot above categorizes body weight into subgroups so we can visualize the relationship between brain size and gestation length within body weight categories. Specifically, each color represents a bin of similarly sized mammals, with each point representing a species, and each line representing the line of fit for that bin. This visualization allows us to see how different the models for each bin are, and how they compare to the model of all animals of all sizes together.
brain_data$cut_ln_body <- cut(brain_data$ln_body,
breaks = 6) # create 6 subgroups
ggplot(brain_data, aes(y = ln_brain, x = ln_gest)) +
geom_point(aes(colour = cut_ln_body)) +
xlab("ln(gestation length) (log-days)") +
ylab("ln(brain weight) (log-gm)") +
geom_smooth(method = "lm",
aes(group = cut_ln_body,
colour = cut_ln_body)) +
geom_smooth(method = "lm", colour = 1, se = FALSE) +
labs(colour = "log(Body Weight) (log kg)")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# another way to visualize, different panels for
# each body weight group
# relevel the factor to make the plot comparable to the
# cut and coded scatter above...
brain_data$cut_ln_body <- factor(brain_data$cut_ln_body,
levels = levels(brain_data$cut_ln_body)[6:1])
ggplot(brain_data, aes(y = ln_brain, x = ln_gest)) +
geom_point() +
xlab("ln(gestation length) (log-days)") +
ylab("ln(brain weight) (log-gm)") +
facet_wrap(~ cut_ln_body, ncol = 1)
What do you notice from the above plots when you compare to the raw data pairs plot and simple linear regression that did not include body weight?
The log transformed data plots separated by weight class are much clearer to interpret. It is easier to see data points and their relationship with one another. The SLR that didn’t include body weight had pretty wide spread about the mean in addition to many of the points falling outside the confidence interval. When considering body weight as a class, the MLR produces multiple means, one for each weight class, with less variation about the means and fewer points falling outside of CIs. When we look at the scatter plot comparing brain weight (g) to gestation length (days) while accounting for body weight, we also see that accounting for body weight seems to flatten out most of the linear relationships especially for groups in the middle. We also generally see the range for each brain weight group become smaller as the brain weight becomes larger and that the largest brain size group has the most linear relationship compared to the other groups. When we compare brain weight (g) and body weight (kg), they seem to have multicollinearity.
The relationship does not look the same for all body size categories. It appears that the smallest and largest brain weights have lower variability and have more linear relationships while groups in the middle of the range have greater variability and a majority of the points are in the middle to upper class range. It also appears that there is a strong outlier for the larger brain size group. Now that we know that certain body sizes have stronger linear relationships with brain size, it could be helpful in picking out which species or brain size groups are having a greater impact on the over trend we see in the SLR. In other words, it makes sense to incorporate the log of body weight into the MLR since it seems to impact brain size groups differently if we are interested in which groups have the biggest impact on the overall trend
Now, let’s actually look at the regression output from the model with body weight, and compare it to the simple linear regression output from earlier. Interpret the coefficient associated with ln_gest in both models (lm_gest and lm_gest_body), how does the interpretation change, does the change make sense based on the figures you made in part 5?
lm_gest_body <- lm(ln_brain ~ ln_gest + ln_body,
data = brain_data)
summary(lm_gest_body)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4572826 0.45848472 -0.997378 3.211691e-01
## ln_gest 0.6678215 0.10874659 6.141080 2.002004e-08
## ln_body 0.5511654 0.03235852 17.033086 2.423957e-30
summary(lm_gest_body)$sigma
## [1] 0.4902111
summary(lm_gest)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.664758 0.5616447 -11.86650 2.164129e-20
## ln_gest 2.233992 0.1172203 19.05805 4.727507e-34
summary(lm_gest)$sigma
## [1] 0.9896692
par(mfrow = c(2,2))
plot(lm_gest_body, col = brain_data$cut_ln_body)
Interpret the coefficient associated with ln_gest in both models (lm_gest(2.233992) and lm_gest_body (0.6678215)), how does the interpretation change, does the change make sense based on the figures you made in part 5?
In a MLR model, the coefficient (B1) associated with gestation represents the rate of change in mean brain weight with changes in gestation length within subpopulations of fixed body size. The coefficient also represents the adjustment from the mean of group 1 (the true mean for the baseline group of brain weight when x = 0) to group 2 (brain weight explained by gestation length alone). When we don’t account for body weight, there is a 2.2340 adjustment from the mean brain size (-6.6647 kg) to gestation length. When we do account for brain weight, there is an adjustment of 0.6678 from the mean brain size of -0.4572. It seems that B1 estimated in the linear regression represents the average between the 6 subpopulations. We see that the adjustment to gestation length (days) is smaller and that gestation length has less variability within all subgroups when we account for body weight. Moreover, when we control for body size, we can see that the slopes and intervals of brain size are smaller. This seems reasonable since we can see from the subgroups plot that body weight influences the gestation length of different subpopulations more heavily than others.