Multiple explanatory variables (cont'd)

M. Drew LaMar
April 25, 2016

“Models should be as simple as possible, but not more so.”

- Attributed to Einstein

What is a model?

Definition: a mathematical model is a mathematical representation of the relationship between a response variable \( Y \) and one or more explanatory variables.

Definition: A general linear model is a statistical linear model that generalizes regression to multiple explanatory variables, some possibly categorical.

For a general linear model, we have:

  • Explanatory: Multiple; Numerical or categorical
  • Response: Multiple; Numerical

Multiple response variables = multivariate. We will only explore 1 response variable.

General linear models

Notationally, if we have two explanatory variables \( X_{1} \) and \( X_{2} \) and one response variable \( Y \), a general linear model is of the form

\[ Y = \alpha + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{12}X_{1}X_{2} + \epsilon, \]

where \( \epsilon \sim N(0,\sigma^{2}) \).

This can also be written without referring to slopes as

\[ \begin{align} \textrm{RESPONSE} = & \textrm{CONSTANT} + \textrm{VARIABLE1} + \\ & \textrm{VARIABLE2} + \\ & \textrm{VARIABLE2*VARIABLE2} \end{align} \]

General linear models: Types

Linear model Other name Example study design
\( Y = \mu + X \) Linear regression Dose-response
\( Y = \mu + A \) One-way ANOVA Completely randomized
\( Y = \mu + A + b \) Two-way ANOVA,
no replication
Randomized block

Key:

  • \( Y \): [Response] Numerical
  • \( X \): [Explanatory] Numerical explanatory
  • \( A,B \): [Explanatory] Categorical, fixed-effect
  • \( b \): [Explanatory] Categorical, random-effect
  • \( \mu \): Constant (mean or intercept)

General linear models: Types (cont'd)

Linear model Other name Example study design
\( Y = \mu + A + B + A*B \) Two-way, fixed-
effects ANOVA
Factorial experiment
\( Y = \mu + A + b + A*b \) Two-way, mixed-
effects ANOVA
Factorial experiment
\( Y = \mu + X + A \) Analysis of covariance (ANCOVA) Observational study
\( Y = \mu + X_{1} + X_{2} + X_{1}*X_{2} \) Multiple regression Dose-response

Linear regression as a GLM

Linear model Other name Example study design
\( Y = \mu + X \) Linear regression Dose-response
\( Y = \mu + A \) One-way ANOVA Completely randomized
\( Y = \mu + A + b \) Two-way ANOVA,
no replication
Randomized block

Example 17.3 (p. 551): Prairie Home Campion

Linear regression as a GLM

Human activity is reducing species numbers in many of the world's ecosystems. Does this decrease affect basic ecosystem properties? Or are different plant species largely substitutable, with lost species compensated by those species remaining? To find out, Tilman et al. (2006) seeded 161 plots, each measuring \( 9 \times 9 \) meters, at the Cedar Creek Reserve in Minnesota. They used a varying number of prairie plant species and measured plant biomass production over 10 subsequent years. Treatments of either 1, 2, 4, 8, or 16 plant species (randomly chosen from a set of 18 perennials) were randomly assigned to plots. After 10 years of measurement, the researchers measured the “stability” of plant biomass production in every plot as mean biomass divided by the standard deviation in biomass over the 10 years (the reciprocal of the coefficient of variation).

Linear regression as a GLM

\[ Y = \alpha + \beta X \]

\[ \mathrm{LOGSTABILITY} = \alpha + \beta(\mathrm{SPECIESNUMBER}) \]

Let's load and inspect the data.

prairie <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17e3PlantDiversityAndStability.csv")
str(prairie)
'data.frame':   161 obs. of  2 variables:
 $ nSpecies        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ biomassStability: num  7.47 6.74 6.61 6.4 5.67 5.26 4.8 4.4 4.4 4.26 ...

Linear regression as a GLM

Null model: \( \mathrm{LOGSTABILITY} = \alpha^{\prime} \)

Regression model: \( \mathrm{LOGSTABILITY} = \alpha + \beta(\mathrm{SPECIESNUMBER}) \)

Linear regression as a GLM

Fit models (perform regressions)

prairieNullModel <- lm(logStability ~ 1, data = prairie)

prairieRegression <- lm(logStability ~ nSpecies, data = prairie)

Linear regression as a GLM

Compare models: Regression model better fit?

anova(prairieNullModel, prairieRegression)
Analysis of Variance Table

Model 1: logStability ~ 1
Model 2: logStability ~ nSpecies
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    160 24.815                                  
2    159 19.298  1    5.5169 45.455 2.733e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: Same results as anova(prairieRegression)

Linear regression as a GLM

Linear regression as a GLM

Compare models: Regression model better fit?

anova(prairieRegression)
Analysis of Variance Table

Response: logStability
           Df  Sum Sq Mean Sq F value    Pr(>F)    
nSpecies    1  5.5169  5.5169  45.455 2.733e-10 ***
Residuals 159 19.2980  0.1214                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One-way, single factor ANOVA as a GLM

Linear model Other name Example study design
\( Y = \mu + X \) Linear regression Dose-response
\( Y = \mu + A \) One-way ANOVA Completely randomized
\( Y = \mu + A + b \) Two-way ANOVA,
no replication
Randomized block

Example 15.1 (p. 460): The knees who say night

One-way, single factor ANOVA as a GLM

Traveling to a different time zone can cause jet lag, but people adjust as the schedule of light to their eyes in the new time zone gradually resets their internal, circadian clock. This change in their internal clock is called a phase shift. Campbell and Murphy (1998) reported that the human circadian clock can also be reset by exposing the back of the knee to light, a finding met with skepticism by some, but hailed as a major discovery by others. A later experiment by Wright and Czeisler (2002) re-examined the phenomenon. The new experiment measured circadian rhythm by the daily cycle of melatonin production in 22 people randomly assigned to one of three light treatments: control, knees, and eyes.“

One-way, single factor ANOVA as a GLM

One-way, single factor ANOVA as a GLM

\[ \begin{align} Y & = \mu + X \ (\textrm{regression}) \\ Y & = \mu + A \ (\textrm{single factor ANOVA}) \end{align} \]

Notation Regression ANOVA
\( \mu \) Intercept Grand mean
X Numerical
A Categorical

One-way, single factor ANOVA as a GLM

\[ Y = \mu + A \]

\[ \mathrm{SHIFT} = \mathrm{CONSTANT} + \mathrm{TREATMENT} \]

Let's load and inspect the data.

circadian <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter15/chap15e1KneesWhoSayNight.csv")
str(circadian)
'data.frame':   22 obs. of  2 variables:
 $ treatment: Factor w/ 3 levels "control","eyes",..: 1 1 1 1 1 1 1 1 3 3 ...
 $ shift    : num  0.53 0.36 0.2 -0.37 -0.6 -0.64 -0.68 -1.27 0.73 0.31 ...

One-way, single factor ANOVA as a GLM

Null model: \( \mathrm{SHIFT} = \mathrm{CONSTANT} \)

Alternative model: \( \mathrm{SHIFT} = \mathrm{CONSTANT} + \mathrm{TREATMENT} \)

One-way, single factor ANOVA as a GLM

Fit models (perform ANOVAs)

circadianNullModel <- lm(shift ~ 1, data = circadian)

circadianAltModel <- lm(shift ~ treatment, data = circadian)

One-way, single factor ANOVA as a GLM

Compare models: Alternative model better fit?

anova(circadianNullModel, circadianAltModel)
Analysis of Variance Table

Model 1: shift ~ 1
Model 2: shift ~ treatment
  Res.Df     RSS Df Sum of Sq      F   Pr(>F)   
1     21 16.6398                                
2     19  9.4153  2    7.2245 7.2894 0.004472 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One-way, single factor ANOVA as a GLM

Definition: \( F \) measures the improvement in the fit of a general linear model to the data when the term describing a given explanatory variable is included in the model, compared with the null model in with the term is absent.

One-way, single factor ANOVA as a GLM

Compare models: Alternative model better fit?

anova(circadianAltModel)
Analysis of Variance Table

Response: shift
          Df Sum Sq Mean Sq F value   Pr(>F)   
treatment  2 7.2245  3.6122  7.2894 0.004472 **
Residuals 19 9.4153  0.4955                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Randomized block as a GLM

Linear model Other name Example study design
\( Y = \mu + X \) Linear regression Dose-response
\( Y = \mu + A \) One-way ANOVA Completely randomized
\( Y = \mu + A + b \) Two-way ANOVA,
no replication
Randomized block

Example 12.2: So macho it makes you sick?

Randomized block: Repeated measures

In many species, males are more likely to attract females if the males have high testosterone levels. Are males with high testosterone paying a cost for this extra mating success in other ways? One hypothesis is that males with high testosterone might be less able to fight off disease. To test this idea, Hasselquist et al. (1999) experimentally increased the testosterone levels of 13 male red-winged blackbirds by surgically implanting a small permeable tube filled with testosterone. They measured immunocompetence as the rate of antibody production in response to a nonpathogenic antigen in each bird's blood serum both before and after the implant. The antibody rates were measured optically, in units of log \( 10^{-3} \) optical density per minute (ln[mOD/min]).

Randomized block: Repeated measures

This is an example of a repeated measures design.

  • Previously, we saw this as a paired design.
  • Now, we can use individual as a blocking factor (random effect).
  • We have two explanatory variables:

    (1) time with levels before and after transplant, which is a categorical fixed effect, and
    (2) individual, which is a categorical random effect.

Thus, we are in the form \( Y = \mu + A + b \), or

\[ \mathrm{TESTOSTERONE} = \mathrm{CONSTANT} + \mathrm{TIME} + \mathrm{SUBJECT} \]

Randomized block: Repeated measures

Let's load and inspect the data.

'data.frame':   13 obs. of  5 variables:
 $ blackbird       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ beforeImplant   : int  105 50 136 90 122 132 131 119 145 130 ...
 $ afterImplant    : int  85 74 145 86 148 148 150 142 151 113 ...
 $ logBeforeImplant: num  4.65 3.91 4.91 4.5 4.8 4.88 4.88 4.78 4.98 4.87 ...
 $ logAfterImplant : num  4.44 4.3 4.98 4.45 5 5 5.01 4.96 5.02 4.73 ...

Discuss: Is this in wide or long format?

Answer: Wide!

Randomized block: Repeated measures

Let's go long!

blackbird_long <- blackbird %>% 
                  select(blackbird, starts_with("log")) %>%
                  gather(time, testosterone, starts_with("log"), factor_key=TRUE)

str(blackbird_long)
'data.frame':   26 obs. of  3 variables:
 $ blackbird   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ time        : Factor w/ 2 levels "logBeforeImplant",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ testosterone: num  4.65 3.91 4.91 4.5 4.8 4.88 4.88 4.78 4.98 4.87 ...

Randomized block: Repeated measures

Remember, we can do a paired \( t \)-test

t.test(testosterone ~ time, paired=TRUE, data=blackbird_long)

    Paired t-test

data:  testosterone by time
t = -1.2714, df = 12, p-value = 0.2277
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.15238464  0.04007695
sample estimates:
mean of the differences 
            -0.05615385 

Randomized block: Repeated measures

Same result if we do mixed-model with subject as random effect

library(nlme)

birdAltModel <- lme(testosterone ~ time, random = ~ 1| blackbird, data = blackbird_long)

Fixed effect is time and random effect is blackbird, our blocking variable.

Essentially, by having the subject (blackbird) as a random effect, we are explaining the amount of variation in testosterone between individuals, leaving the remaining variation to be explained by the procedure and error.

Randomized block: Repeated measures

summary(birdAltModel)
Linear mixed-effects model fit by REML
 Data: blackbird_long 
      AIC      BIC   logLik
  4.71235 9.424565 1.643825

Random effects:
 Formula: ~1 | blackbird
        (Intercept)  Residual
StdDev:   0.2463566 0.1126033

Fixed effects: testosterone ~ time 
                       Value  Std.Error DF  t-value p-value
(Intercept)         4.733846 0.07512609 12 63.01201  0.0000
timelogAfterImplant 0.056154 0.04416664 12  1.27141  0.2277
 Correlation: 
                    (Intr)
timelogAfterImplant -0.294

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-2.0341692 -0.3878921  0.1435801  0.4695251  0.9996207 

Number of Observations: 26
Number of Groups: 13 

We get the exact same \( P \)-value as a paired \( t \)-test!

Randomized block: Repeated measures

\[ \begin{align} \bar{Y}_{\mathrm{Before}} & = 4.7338\\ \bar{Y}_{\mathrm{Blackbird \ | \ Before}} & = 4.7338 + N(0,0.2464^2) \end{align} \]

Randomized block: Repeated measures

\[ \begin{align} \bar{Y}_{\mathrm{After}} & = 4.7338+0.0562\\ \bar{Y}_{\mathrm{Blackbird \ | \ After}} & = 4.7338+0.0562 + N(0,0.2464^2) \end{align} \]

Randomized block design: 3 groups

Example 18.2: Zooplankton depredation

Randomized block design: 3 groups

Example 18.2: Zooplankton depredation

Svanbäck and Bolnick (2007) set up a field experiment in the shallows of a small lake on Vancouver Island (British Columbia) that allowed them to measure how fish abundance affects the abundance and diversity of prey species. They used a randomized block design to minimize noise caused by background variation in prey availability between locations in the lake. Three fish abundance treatments - Low, High, and Control - were set up at each of five locations in the lake. Diversity (Levin's D) of zooplankton prey in each treatment at the five locations was measured after 13 days.

Randomized block design: 3 groups

Load and display data.

  treatment diversity block
1   control       4.1     1
2       low       2.2     1
3      high       1.3     1
4   control       3.2     2
5       low       2.4     2
6      high       2.0     2

Randomized block design: 3 groups

Take a peek.

plot of chunk unnamed-chunk-17

Randomized block design: 3 groups

zoopAltModel <- lme(diversity ~ treatment, random = ~ 1| block, data = zooplankton)

Randomized block design: 3 groups

summary(zoopAltModel)
Linear mixed-effects model fit by REML
 Data: zooplankton 
       AIC      BIC    logLik
  34.23401 36.65855 -12.11701

Random effects:
 Formula: ~1 | block
        (Intercept)  Residual
StdDev:    0.353789 0.4577117

Fixed effects: diversity ~ treatment 
              Value Std.Error DF   t-value p-value
(Intercept)    3.02 0.2587148  8 11.673087  0.0000
treatmentlow  -1.02 0.2894823  8 -3.523531  0.0078
treatmenthigh -1.64 0.2894823  8 -5.665286  0.0005
 Correlation: 
              (Intr) trtmntl
treatmentlow  -0.559        
treatmenthigh -0.559  0.500 

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-1.2763231 -0.6798032 -0.1239909  0.3587149  1.7986166 

Number of Observations: 15
Number of Groups: 5 

Randomized block design: 3 groups

plot of chunk unnamed-chunk-20

\[ \begin{align} \bar{Y}_{\mathrm{block \ | \ control}} & = 3.02 + N(0,0.354^2) \\ \bar{Y}_{\mathrm{block \ | \ low}} & = 2.00 + N(0,0.354^2) \\ \bar{Y}_{\mathrm{block \ | \ high}} & = 1.38 + N(0,0.354^2) \end{align} \]