Starting off
Correlation
Simple linear regression
- x is fixed
- x is random
ANCOVA
This week’s tasks
- Task 1
- Task 2
- Task 3

Starting off

Remember to set your working directory before you start. (See Week 1’s tutorial if you forget how.) You also need to load today’s packages.

library(car)
library(effectsize)
library(emmeans)
library(ggplot2)
library(ggthemes)
library(ggExtra)
library(lmodel2)

Correlation

Our example dataset is stored as neuroticism.csv. This dataset (taken from Mobbs et al., 2005) contains scores of extroversion and neuroticism in 17 volunteers. We are interested in knowing whether neuroticism and extroversion are related to one another. In this case, both variables are random and there is not really an obvious “predictor” or “response” variable, so a correlation test is appropriate.

Correlations can be calculated in R using the cor() function and significance can be calculated using the cor.test() function. We would like to calculate Pearson’s product moment correlation coefficient (r) if possible, because this is the most commonly used and most easily interpreted correlation coefficient. The correlation test for Pearson’s correlation coefficient is also the most powerful if its assumptions are met. To safely test this correlation coefficient, we must check that the data are bivariate normal and that the true relationship appears to be linear.

We will look at our data with a scatterplot. In addition, we will add “rug plots” and “violin plots” on the x- and y-axes so we can view the marginal distributions of extroversion and neuroticism. If we meet the assumptions we need to test Pearson’s correlation coefficient, both marginal distributions should be shaped like normal distributions, and the scatterplot should look like an ellipse.

d <- read.csv("Data/neuroticism.csv")

# Note, to add a 95% confidence ellipse for the population mean of
# (extroversion, neuroticism) to this plot, we could add a layer for:
# stat_ellipse(alpha = 0.3)
myPlot <- ggplot(data = d, aes(x = extroversion, y = neuroticism)) + 
  geom_point() +
  geom_rug(sides = "tr", color = "grey") + # "tr" = top right
  labs(x = "Extroversion", y = "Neuroticism") +
  theme_few()
ggMarginal(myPlot, type = "violin")

With so few data points, it is difficult to determine if the data are truly bivariate normal, but the slightly bimodal nature of extroversion score is a little concerning. It may be more appropriate to calculate Spearman’s rank correlation coefficient (r_s) rather than Pearson’s correlation coefficient in this case. r_s does not indicate the straight-line relationship between variables; it only indicates how they tend to increase or decrease monotonically with respect to one another.

cor.test(d$extroversion, d$neuroticism, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  d$extroversion and d$neuroticism
## S = 1247.7, p-value = 0.02897
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5290943

Extroversion and neuroticism are significantly negatively correlated (\(r_s=-0.53, p=0.029\)).

Simple linear regression

x is fixed¹

Tribolium are flour beetles. Nine batches of Tribolium were weighed at the beginning of a trial, kept in different relative humidities, and weighed again after six days of starvation. Greater weight loss was expected at lower humidities due to water loss. This dataset is stored as tribolium.csv and was originally published by Nelson (1964).

d <- read.csv("Data/tribolium.csv")
ggplot(data = d, aes(x = Humidity, y = WeightLoss)) + 
  geom_point() +
  labs(x = "Percent relative humidity", y = "Weight loss (mg)") +
  theme_few()

It looks as if a straight line may fit this data very well. Let’s conduct an ordinary least squares (OLS) regression and plot the residuals.

tribolium.lm <- lm(WeightLoss ~ Humidity, data = d)
opar <- par(mfrow = c(2, 2))
plot(tribolium.lm)
par(opar)

The residuals don’t look too horrible. With small sample sizes, some deviation is not unusual, and it is difficult to tell if the assumptions are really violated or not from these plots. (Next week, we will use some better methods for examining model assumptions. I have used them on this dataset and confirmed that it does not appear to violate any assumptions.)

We can get an ANOVA table for our regression using the anova() function, and a coefficient table using summary(). anova() uses Type I sums of squares, but that is OK for a simple linear regression. We can also get the coefficients of the model and their 95% confidence intervals using coefficients() and confint():

anova(tribolium.lm)

## Analysis of Variance Table
## 
## Response: WeightLoss
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Humidity   1 23.5145  23.515  267.18 7.816e-07 ***
## Residuals  7  0.6161   0.088                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(tribolium.lm)

## 
## Call:
## lm(formula = WeightLoss ~ Humidity, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46397 -0.03437  0.01675  0.07464  0.45236 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.704027   0.191565   45.44 6.54e-10 ***
## Humidity    -0.053222   0.003256  -16.35 7.82e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2967 on 7 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9708 
## F-statistic: 267.2 on 1 and 7 DF,  p-value: 7.816e-07

coefficients(tribolium.lm)

## (Intercept)    Humidity 
##  8.70402730 -0.05322215

confint(tribolium.lm)

##                   2.5 %      97.5 %
## (Intercept)  8.25104923  9.15700538
## Humidity    -0.06092143 -0.04552287

Higher humidity levels are associated with less weight loss in Tribolium (\(b=-0.05,R^2=0.97,F_{1,7}=267.2,p<0.001\)).

Now let’s re-plot our data with the regression line. geom_smooth(method = "lm") adds a simple linear regression line calculated using lm(), the same as we did ourselves above.

ggplot(data = d, aes(x = Humidity, y = WeightLoss)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Percent relative humidity", y = "Weight loss (mg)") +
  scale_color_colorblind() +
  theme_few()

## `geom_smooth()` using formula = 'y ~ x'

x is random

If x is a random variable, then we are in the situation of what is annoyingly called a “model II regression”. If your primary goal is to predict y from x, then you can continue to use lm() to conduct OLS regression as above. However, if you are primarily interested in determining the functional relationship between y and x (i.e., the slope), then a different regression method is often suggested.

The R package lmodel2 is designed for simple model II regression. It contains functions for OLS regression, as well as major axis (MA) regression, standard major axis (SMA) regression, and ranged major axis (RMA) regression.² The package vignette contains good recommendations about when to use each type of regression. To view the package vignette, type: vignette("mod2user", package = "lmodel2")

The dataset we will use is fisheggs.csv (originally from Sokal & Rohlf, 1995). The dataset contains mass measurements (x) for 11 female fish and the number of eggs they later produced (y). You want to estimate the slope of the line. Note that:

x is random; so this is a model II regression
It is unlikely that the measurement error for x (mass) is much smaller than the measurement error for y (number of eggs), so OLS regression is inappropriate.
x and y are not in the same units, so we should probably not use MA regression.
- If x and y are significantly correlated, then we can use SMA.
- If there are no outliers, then we can use RMA.
- To safely use any of the “major axis” methods, we need to confirm that the data are distributed approximately according to a bivariate normal distribution.

Let’s plot the data to see if we meet these assumptions.

d <- read.csv("Data/fisheggs.csv")

myPlot <- ggplot(data = d, aes(x = Mass, y = No_eggs)) + 
  geom_point() +
  geom_rug(sides = "tr", color = "grey") + # "tr" = top right
  labs(x = "Fish mass", y = "Num. offspring") +
  theme_few()
ggMarginal(myPlot, type = "violin")

There appears to be a linear relationship. With so few data points it is difficult to tell whether the distributions are normal or not. I see no reason to strongly suspect that they are non-normal, although the possible presence of three “groups” is noticeable.³ There are no outliers, so RMA regression would be safe to use. Let’s conduct the regression and check for a significant correlation:

(fisheggs.lm <- lmodel2(No_eggs ~ Mass, data = d,
                        range.x = "relative", range.y = "relative", nperm = 999))

## 
## Model II regression
## 
## Call: lmodel2(formula = No_eggs ~ Mass, data = d, range.y = "relative",
## range.x = "relative", nperm = 999)
## 
## n = 11   r = 0.882318   r-square = 0.7784851 
## Parametric P-values:   2-tailed = 0.0003241737    1-tailed = 0.0001620869 
## Angle between the two OLS regression lines = 5.534075 degrees
## 
## Permutation tests of OLS, MA, RMA slopes: 1-tailed, tail corresponding to sign
## A permutation test of r is equivalent to a permutation test of the OLS slope
## P-perm for SMA = NA because the SMA slope cannot be tested
## 
## Regression results
##   Method Intercept    Slope Angle (degrees) P-perm (1-tailed)
## 1    OLS 19.766816 1.869955        61.86337             0.001
## 2     MA  6.656633 2.301728        66.51716             0.001
## 3    SMA 12.193785 2.119366        64.74023                NA
## 4    RMA 13.179672 2.086897        64.39718             0.001
## 
## Confidence intervals
##   Method 2.5%-Intercept 97.5%-Intercept 2.5%-Slope 97.5%-Slope
## 1    OLS      -4.098376        43.63201   1.117797    2.622113
## 2     MA     -36.540814        27.83295   1.604304    3.724398
## 3    SMA     -14.576929        31.09957   1.496721    3.001037
## 4    RMA     -18.475744        35.25317   1.359925    3.129441
## 
## Eigenvalues: 494.634 17.49327 
## 
## H statistic used for computing C.I. of MA: 0.02161051

Technically, it is not possible to directly test the significance of an SMA regression (i.e., you cannot formally test whether its slope differs significantly from zero), but the significance of the correlation coefficient is essentially the same thing. In our case, the variables are significantly positively correlated (\(r = 0.88, p < 0.001\)), so we are okay to use SMA regression. SMA regression has the benefit of being more well known and more easily interpreted than RMA regression. Therefore, I have shown only the results for SMA regression below. (The RMA regression line is almost identical.)

plot(fisheggs.lm, "SMA", main = NULL, xlab = "Body mass", ylab = "Num. eggs")

Number of eggs laid is signficantly positively correlated with body mass (\(r = 0.88, p < 0.001\)), and the SMA regression line of best fit is: \(num\_eggs = 12.19 + 2.12 \times mass\).

ANCOVA

Analysis of covariance (ANCOVA) is an ANOVA with the addition of a continuous covariate.

Consider the dataset stored in NeanderthalBrainSize.csv (originally from Ruff et al., 1997). The purpose of this analysis was to determine whether Neanderthals and recent humans had significantly different brain sizes after accounting for differences in body size. First, let’s plot the data with a separate slope for each species.

d <- read.csv("Data/NeanderthalBrainSize.csv")
d$species <- factor(d$species,
                    levels = c("recent", "neanderthal"),
                    labels = c("Recent human", "Neanderthal"))

ggplot(data = d, aes(x = lnMass, y = lnBrain, colour = species, fill = species)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "ln(body mass)", y = "ln(brain mass)",
       colour = "Species", fill = "Species") +
  scale_colour_colorblind() + 
  scale_fill_colorblind() + 
  theme_few()

Neanderthals and recent humans appear to have slightly different slopes, suggesting a possible interaction; however, the confidence intervals for the slopes are very broad. Other possible points of concern are that Neanderthals may have slightly greater error variance and that the ranges of the covariate are not identical in both groups (although they overlap substantially).

Let’s formally test for an interaction. ANCOVA can be conducted using the lm() function.

neanderthal.lm <- lm(lnBrain ~ lnMass * species, data = d)
Anova(neanderthal.lm, type = 3)

## Anova Table (Type III tests)
## 
## Response: lnBrain
##                 Sum Sq Df  F value    Pr(>F)    
## (Intercept)    0.75102  1 169.5492 5.590e-15 ***
## lnMass         0.09110  1  20.5664 6.478e-05 ***
## species        0.00547  1   1.2358    0.2739    
## lnMass:species 0.00485  1   1.0938    0.3028    
## Residuals      0.15503 35                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction is not significant, so we can remove it and recalculate.

neanderthal.lm <- lm(lnBrain ~ lnMass + species, data = d)
(neanderthal.tab <- Anova(neanderthal.lm, type = 3))

## Anova Table (Type III tests)
## 
## Response: lnBrain
##              Sum Sq Df  F value    Pr(>F)    
## (Intercept) 0.83979  1 189.0958 6.719e-16 ***
## lnMass      0.13002  1  29.2764 4.262e-06 ***
## species     0.02755  1   6.2041   0.01749 *  
## Residuals   0.15988 36                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Plot the residuals:

opar <- par(mfrow = c(2, 2))
plot(neanderthal.lm)
par(opar)

It looks pretty good to me. Now calculate the effect sizes.

omega_squared(neanderthal.tab)

## # Effect Size for ANOVA (Type III)
## 
## Parameter | Omega2 (partial) |       95% CI
## -------------------------------------------
## lnMass    |             0.42 | [0.22, 1.00]
## species   |             0.12 | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

Show the coefficient table.

S(neanderthal.lm)

## Call: lm(formula = lnBrain ~ lnMass + species, data = d)
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.25835    0.38239  13.751 6.72e-16 ***
## lnMass              0.49632    0.09173   5.411 4.26e-06 ***
## speciesNeanderthal -0.07028    0.02822  -2.491   0.0175 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard deviation: 0.06664 on 36 degrees of freedom
## Multiple R-squared: 0.4486
## F-statistic: 14.65 on 2 and 36 DF,  p-value: 2.219e-05 
##    AIC    BIC 
## -95.70 -89.05

Show confidence intervals.

confint(neanderthal.lm)

##                         2.5 %      97.5 %
## (Intercept)         4.4828232  6.03387599
## lnMass              0.3102840  0.68234820
## speciesNeanderthal -0.1275014 -0.01305537

Now we can get the EMM:

emmeans(neanderthal.lm, ~species)

##  species      emmean     SE df lower.CL upper.CL
##  Recent human  7.342 0.0125 36    7.317    7.367
##  Neanderthal   7.272 0.0242 36    7.223    7.321
## 
## Confidence level used: 0.95

Recent humans have larger (log-transformed) brain mass than Neanderthals after adjusting for (log-transformed) body mass (\(F_{1,36}=6.2,p=0.017\)).

This week’s tasks

Write an R script that contains the following analyses.

Task 1

The wolves.csv dataset (originally from Liberg et al., 2005) contains information on the inbreeding coefficients of Scandanavian wolf cub litters and the number of wolf cubs in each litter that survived the winter. Based on what we know about genetics, we suspect that more-inbred litters (i.e., litters that result from the mating of closer relatives) will have fewer surviving cubs because of inbreeding depression. Plot the data and conduct an appropriate statistical test of this hypothesis.

Task 2

The larvalfish.csv dataset (originally from Hsieh et al., 2006) was collected to determine whether exploitation (i.e., fishing) has an effect on the year-to-year variability in population size after accounting for the fact that different species of fish reach maturity at different ages. Year-to-year variability in population size was recorded as the coefficient of variation (CV), which is the standard deviation divided by the mean. Plot the data and conduct the appropriate statistical test(s).

Task 3

The seaurchin.csv dataset (originally from Ebert & Russell, 1994) was collected to determine the relationship between total body mass and gonad mass of the slate-pencil sea urchin. How does gonad mass scale as a function of body mass? Measurements of these variables were taken from 17 different sea urchins. Plot the data and conduct the appropriate statistical test(s) to address your research question.

Note: this analysis is specific to the case in which you have one y-value per x-value. In the case that you have multiple y-values per x-value, the analysis will be slightly different. We will cover this scenario when we get to mixed-effects models.↩︎
Note: there is also another meaning for “RMA”. Reduced major axis regression is another name for standard major axis regression. Do not confuse reduced major axis regression with ranged major axis regression; they are not the same.↩︎
To formally test for multivariate normality, we could use the mvn() function in the MVN package. I have done this and confirmed that these data do not deviate significantly from bivariate normal.↩︎

BIOL 5473 Tutorial 7