Multiple explanatory variables (Part 2)

M. Drew LaMar
April 17, 2017

“Models should be as simple as possible, but not more so.”

- Attributed to Einstein

Class announcements

  • Reading Assignment for Wednesday (no optional write-up for this):
    • Ruxton & Colegrave, Chapter 4: Different experimental designs
  • Exam #3 this Friday
    • W&S: Ch. 11-15 (excl. 14); R&C: Ch. 3
    • You may bring a formula sheet (2 sided is fine)!!

What is a model?

Definition: A model is an abstract (or concrete) representation of objects and their relationships or processes in the real world.

Definition: A mathematical model is a mathematical representation of the relationship between a response variable \( Y \) and one or more explanatory variables.

What is a General Linear Model?

Definition: A general linear model is a statistical linear model that generalizes regression to multiple explanatory variables, some possibly categorical.

For a general linear model, we have:

  • Explanatory: Multiple; Numerical or categorical
  • Response: Multiple; Numerical

Multiple response variables = multivariate. We will only explore 1 response variable.

Randomized block as a GLM

Linear model Other name Example study design
\( Y = \mu + X \) Linear regression Dose-response
\( Y = \mu + A \) One-way ANOVA Completely randomized
\( Y = \mu + A + b \) Two-way ANOVA,
no replication
Randomized block

Example 12.2: So macho it makes you sick?

Randomized block: Repeated measures

…Hasselquist et al. (1999) experimentally increased the testosterone levels of 13 male red-winged blackbirds by surgically implanting a small permeable tube filled with testosterone. They measured immunocompetence as the rate of antibody production in response to a nonpathogenic antigen in each bird's blood serum both before and after the implant. The antibody rates were measured optically, in units of log \( 10^{-3} \) optical density per minute (ln[mOD/min]).

Randomized block: Repeated measures

This is an example of a repeated measures design.

  • Previously, we saw this as a paired design.
  • Now, we can use individual as a blocking factor (random effect).
  • We have two explanatory variables:

    (1) time with levels before and after transplant, which is a categorical fixed effect, and
    (2) individual, which is a categorical random effect.

Thus, we are in the form \( Y = \mu + A + b \), or

\[ \mathrm{TESTOSTERONE} = \mathrm{CONSTANT} + \mathrm{TIME} + \mathrm{SUBJECT} \]

Randomized block: Repeated measures

Let's load and inspect the data.

'data.frame':   13 obs. of  5 variables:
 $ blackbird       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ beforeImplant   : int  105 50 136 90 122 132 131 119 145 130 ...
 $ afterImplant    : int  85 74 145 86 148 148 150 142 151 113 ...
 $ logBeforeImplant: num  4.65 3.91 4.91 4.5 4.8 4.88 4.88 4.78 4.98 4.87 ...
 $ logAfterImplant : num  4.44 4.3 4.98 4.45 5 5 5.01 4.96 5.02 4.73 ...

Discuss: Is this in wide or long format?

Answer: Wide!

Randomized block: Repeated measures

Let's go long!

'data.frame':   26 obs. of  3 variables:
 $ blackbird   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ time        : Factor w/ 2 levels "logBeforeImplant",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ testosterone: num  4.65 3.91 4.91 4.5 4.8 4.88 4.88 4.78 4.98 4.87 ...

Randomized block: Repeated measures

Remember, we can do a paired \( t \)-test

t.test(testosterone ~ time, paired=TRUE, data=blackbird_long)

    Paired t-test

data:  testosterone by time
t = -1.2714, df = 12, p-value = 0.2277
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.15238464  0.04007695
sample estimates:
mean of the differences 
            -0.05615385 

Randomized block: Repeated measures

Same result if we do mixed-model with subject as random effect

birdAltModel <- lme(testosterone ~ time, 
                    random = ~ 1| blackbird, 
                    data = blackbird_long)

Fixed effect is time and random effect is blackbird, our blocking variable.

By having the subject (blackbird) as a random effect, we are explaining the amount of variation in testosterone between individuals, leaving the remaining variation to be explained by the procedure (fixed effect) and error (residual).

Randomized block: Repeated measures

summary(birdAltModel)
Linear mixed-effects model fit by REML
 Data: blackbird_long 
      AIC      BIC   logLik
  4.71235 9.424565 1.643825

Random effects:
 Formula: ~1 | blackbird
        (Intercept)  Residual
StdDev:   0.2463566 0.1126033

Fixed effects: testosterone ~ time 
                       Value  Std.Error DF  t-value p-value
(Intercept)         4.733846 0.07512609 12 63.01201  0.0000
timelogAfterImplant 0.056154 0.04416664 12  1.27141  0.2277
 Correlation: 
                    (Intr)
timelogAfterImplant -0.294

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-2.0341692 -0.3878921  0.1435801  0.4695251  0.9996207 

Number of Observations: 26
Number of Groups: 13 

We get the exact same \( P \)-value as a paired \( t \)-test!

Randomized block: Repeated measures

\[ \begin{align} \bar{Y}_{\mathrm{Before}} & = 4.7338\\ \bar{Y}_{\mathrm{After}} & = 4.7338+0.0562 \end{align} \]

Randomized block design: 3 groups

Example 18.2: Zooplankton depredation

Randomized block design: 3 groups

Example 18.2: Zooplankton depredation

Svanbäck and Bolnick (2007) set up a field experiment in the shallows of a small lake on Vancouver Island (British Columbia) that allowed them to measure how fish abundance affects the abundance and diversity of prey species. They used a randomized block design to minimize noise caused by background variation in prey availability between locations in the lake. Three fish abundance treatments - Low, High, and Control - were set up at each of five locations in the lake. Diversity (Levin's D) of zooplankton prey in each treatment at the five locations was measured after 13 days.

Randomized block design: 3 groups

Load and display data.

  treatment diversity block
1   control       4.1     1
2       low       2.2     1
3      high       1.3     1
4   control       3.2     2
5       low       2.4     2
6      high       2.0     2

Randomized block design: 3 groups

Take a peek.

plot of chunk unnamed-chunk-8

Randomized block design: 3 groups

zoopAltModel <- lme(diversity ~ treatment, 
                    random = ~ 1| block, 
                    data = zooplankton)

Randomized block design: 3 groups

summary(zoopAltModel)
Linear mixed-effects model fit by REML
 Data: zooplankton 
       AIC      BIC    logLik
  34.23401 36.65855 -12.11701

Random effects:
 Formula: ~1 | block
        (Intercept)  Residual
StdDev:    0.353789 0.4577117

Fixed effects: diversity ~ treatment 
              Value Std.Error DF   t-value p-value
(Intercept)    3.02 0.2587148  8 11.673087  0.0000
treatmentlow  -1.02 0.2894823  8 -3.523531  0.0078
treatmenthigh -1.64 0.2894823  8 -5.665286  0.0005
 Correlation: 
              (Intr) trtmntl
treatmentlow  -0.559        
treatmenthigh -0.559  0.500 

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-1.2763231 -0.6798032 -0.1239909  0.3587149  1.7986166 

Number of Observations: 15
Number of Groups: 5 

Randomized block design: 3 groups

anova(zoopAltModel)
            numDF denDF   F-value p-value
(Intercept)     1     8 116.69518  <.0001
treatment       2     8  16.36595  0.0015

Discuss: Conclusion?

Answer: Statistically significant treatment effect!

Randomized block design: 3 groups

plot of chunk unnamed-chunk-12

\[ \begin{align} \bar{Y}_{\mathrm{control}} & = 3.02 \\ \bar{Y}_{\mathrm{low}} & = 2.00 \\ \bar{Y}_{\mathrm{high}} & = 1.38 \end{align} \]

Factorial designs using GLMs

Linear model Other name Example study design
\( Y = \mu + A + B + A*B \) Two-way, fixed-
effects ANOVA
Factorial experiment
\( Y = \mu + A + b + A*b \) Two-way, mixed-
effects ANOVA
Factorial experiment
\( Y = \mu + X + A \) Analysis of covariance (ANCOVA) Observational study

The explanatory variables are called factors, as they represent treatments of direct interest.

\( \mathrm{A} \) and \( \mathrm{B} \) are called main effects; they represent effects of each factor alone, when averaged over the categories of the other factor.

Factorial designs using GLMs

Linear model Other name Example study design
\( Y = \mu + A + B + A*B \) Two-way, fixed-
effects ANOVA
Factorial experiment
\( Y = \mu + A + b + A*b \) Two-way, mixed-
effects ANOVA
Factorial experiment
\( Y = \mu + X + A \) Analysis of covariance (ANCOVA) Observational study

The explanatory variables are called factors, as they represent treatments of direct interest.

\( \mathrm{A*B} \) is the interaction term.

Factorial designs using GLMs

Factorial designs using GLMs

Example 18.3: Interaction zone

Factorial designs using GLMs

Harley (2003) investigated how herbivores affect the abundance of plants living in the intertidal habitat of coastal Washington using field transplants of a red alga, Mazzaella parksii. The experiment also examined whether the effect of herbivores on the algae depended on where in the intertidal zone the plants were growing. Thirty-two study plots were established just above the low-tide mark, and another 32 plots were set at mid-height between the low- and high-tide marks. Using copper fencing, herbivores were excluded from a randomly chosen half of the plots at each height.

Factorial designs using GLMs

\[ \begin{align} \mathrm{ALGAE} = & \mathrm{CONSTANT} + \mathrm{HERBIVORY} + \mathrm{HEIGHT} +\\ & \mathrm{HERBIVORY*HEIGHT} \end{align} \]

Factorial designs using GLMs

\[ \begin{align} \mathrm{ALGAE} = & \mathrm{CONSTANT} + \mathrm{HERBIVORY} + \mathrm{HEIGHT} +\\ & \mathrm{HERBIVORY*HEIGHT} \end{align} \]

We need to examine the improvement in fit of the model to the data with and without each term (i.e. main effects and interaction effect).

Question: Does herbivory have an effect on mean algal cover?

Null Model (Type I): \[ \mathrm{ALGAE} = \mathrm{CONSTANT} \]

Alt Model: \[ \mathrm{ALGAE} = \mathrm{CONSTANT} + \mathrm{HERBIVORY} \]

Factorial designs using GLMs

\[ \begin{align} \mathrm{ALGAE} = & \mathrm{CONSTANT} + \mathrm{HERBIVORY} + \mathrm{HEIGHT} +\\ & \mathrm{HERBIVORY*HEIGHT} \end{align} \]

We need to examine the improvement in fit of the model to the data with and without each term (i.e. main effects and interaction effect).

Question: Does height have an effect on mean algal cover?

Null Model (Type I): \[ \mathrm{ALGAE} = \mathrm{CONSTANT} \]

Alt Model: \[ \mathrm{ALGAE} = \mathrm{CONSTANT} + \mathrm{HEIGHT} \]

Factorial designs using GLMs

\[ \begin{align} \mathrm{ALGAE} = & \mathrm{CONSTANT} + \mathrm{HERBIVORY} + \mathrm{HEIGHT} +\\ & \mathrm{HERBIVORY*HEIGHT} \end{align} \]

We need to examine the improvement in fit of the model to the data with and without each term (i.e. main effects and interaction effect).

Question: Does effect of herbivory on mean algal cover depend on height?

Null Model: \[ \mathrm{ALGAE} = \mathrm{CONSTANT} + \mathrm{HERBIVORY} + \mathrm{HEIGHT} \]

Alt Model: \[ \begin{align} \mathrm{ALGAE} = & \mathrm{CONSTANT} + \mathrm{HERBIVORY} + \mathrm{HEIGHT} +\\ & \mathrm{HERBIVORY*HEIGHT} \end{align} \]

Factorial designs using GLMs

All of these terms get tested at one time using the following notation in R.

algaeFullModel <- lm(sqrtArea ~ height * herbivores, 
                     data = algae)

Caution: height * herbivores refers to all terms, while height : herbivores refers to the interaction term, i.e. the following performs the same analysis:

algaeFullModel <- lm(sqrtArea ~ height + 
                       herbivores + 
                       height : herbivores, 
                     data = algae)

Factorial designs using GLMs

Let's load the data.

'data.frame':   64 obs. of  3 variables:
 $ height    : Factor w/ 2 levels "low","mid": 1 1 1 1 1 1 1 1 1 1 ...
 $ herbivores: Factor w/ 2 levels "minus","plus": 1 1 1 1 1 1 1 1 1 1 ...
 $ sqrtArea  : num  9.41 34.47 46.67 16.64 24.38 ...

Factorial designs using GLMs

anova(algaeFullModel)
Analysis of Variance Table

Response: sqrtArea
                  Df  Sum Sq Mean Sq F value   Pr(>F)   
height             1    89.0   88.97  0.3741 0.543096   
herbivores         1  1512.2 1512.18  6.3579 0.014360 * 
height:herbivores  1  2617.0 2616.96 11.0029 0.001549 **
Residuals         60 14270.5  237.84                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Factorial designs using GLMs

algaeFullModel$coefficients
             (Intercept)                heightmid           herbivoresplus 
                32.91450                -10.43090                -22.51075 
heightmid:herbivoresplus 
                25.57809 
Term Estimate
(Intercept) 32.9145029
heightmid -10.430905
herbivoresplus -22.5107478
heightmid:herbivoresplus 25.5780939

Factorial designs using GLMs

Term Estimate
(Intercept) 32.9145029
heightmid -10.430905
herbivoresplus -22.5107478
heightmid:herbivoresplus 25.5780939

plot of chunk unnamed-chunk-19

ANCOVA as a GLM

Linear model Other name Example study design
\( Y = \mu + A + B + A*B \) Two-way, fixed-
effects ANOVA
Factorial experiment
\( Y = \mu + A + b + A*b \) Two-way, mixed-
effects ANOVA
Factorial experiment
\( Y = \mu + X + A \) Analysis of covariance (ANCOVA) Observational study

\[ \begin{align} \mathrm{RESPONSE} = & \mathrm{CONSTANT} + \mathrm{COVARIATE} + \\ & \mathrm{TREATMENT} \end{align} \]

Used to investigate if linear regressions fit to data with two or more groups have the same slope.

For observational studies, the numerical explanatory is called a confounding variable.

ANCOVA as a GLM

Example 18.4: Mole-rat layabouts

ANCOVA as a GLM

“Mole rats are the only known mammals with distinct social castes. A single queen and a small number of males are the only reproducing individuals in a colony. Remaining individuals, called workers, gather food, defend the colony, care for the young, and maintain the burrows. Recently, it was discovered that there might be two worker castes in the Damaraland mole rat. "Frequent workers” do almost all of the work in the colony, whereas “infrequent workers” do little work except on rare occasions… Scantlebury et al. (2006) compared daily energy expenditures of wild mole rats during a dry season. Energy expenditure appears to vary with body mass in both groups, but infrequent workers are heavier than frequent workers.

ANCOVA as a GLM

Open circles: “frequent workers”
Solid circles: “infrequent workers”

Question: How different is mean energy expenditure between the two groups when adjusted for differences in body mass?

ANCOVA as a GLM

Step 0: Load data

moleRat <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter18/chap18e4MoleRatLayabouts.csv"))
moleRatSorted <- moleRat %>% arrange(lnMass)
str(moleRatSorted)
'data.frame':   35 obs. of  3 variables:
 $ caste   : Factor w/ 2 levels "lazy","worker": 2 2 2 2 2 2 2 2 2 2 ...
 $ lnMass  : num  3.85 3.95 3.99 3.99 4.11 ...
 $ lnEnergy: num  3.69 4.11 3.69 4.19 3.69 ...

ANCOVA as a GLM

Step 1: Test for interaction

moleRatNoInteractModel <- lm(lnEnergy ~ lnMass + caste, data = moleRatSorted)

moleRatFullModel <- lm(lnEnergy ~ lnMass * caste, data = moleRatSorted)

anova(moleRatNoInteractModel, moleRatFullModel)
Analysis of Variance Table

Model 1: lnEnergy ~ lnMass + caste
Model 2: lnEnergy ~ lnMass * caste
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     32 2.8145                           
2     31 2.7249  1  0.089557 1.0188 0.3206

Discuss: Conclusion?

ANCOVA as a GLM

Step 2: Test for differences in castes, assuming no interaction

anova(moleRatNoInteractModel)
Analysis of Variance Table

Response: lnEnergy
          Df  Sum Sq Mean Sq F value    Pr(>F)    
lnMass     1 1.31061 1.31061 14.9013 0.0005178 ***
caste      1 0.63747 0.63747  7.2478 0.0111984 *  
Residuals 32 2.81450 0.08795                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Discuss: Conclusion?

ANCOVA as a GLM

plot of chunk unnamed-chunk-23

ANCOVA as a GLM

Effect size: \( \ \hat{Y}_{\mathrm{frequent}}-\hat{Y}_{\mathrm{infrequent}} = 0.39334 \)

ANCOVA as a GLM

Estimation of effect size

moleRatNoInteractModel

Call:
lm(formula = lnEnergy ~ lnMass + caste, data = moleRatSorted)

Coefficients:
(Intercept)       lnMass  casteworker  
   -0.09687      0.89282      0.39334  
confint(moleRatNoInteractModel)
                  2.5 %    97.5 %
(Intercept) -2.01627751 1.8225407
lnMass       0.49961872 1.2860115
casteworker  0.09573442 0.6909503