M. Drew LaMar
April 14, 2017
“Statistics may be defined as 'a body of methods for making wise decisions in the face of uncertainty.'”
- W.A. Wallis
Definition: A
model is an abstract (or concrete) representation of objects and their relationships or processes in the real world.
Definition: A
mathematical model is a mathematical representation of the relationship between a response variable \( Y \) and one or more explanatory variables.
Definition: A
general linear model is a statistical linear model that generalizes regression to multiple explanatory variables, some possibly categorical.
For a general linear model, we have:
Multiple response variables = multivariate. We will only explore 1 response variable.
Notationally, if we have two explanatory variables \( X_{1} \) and \( X_{2} \) and one response variable \( Y \), a general linear model is of the form
\[ Y = \alpha + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{12}X_{1}X_{2} + \epsilon, \]
where \( \epsilon \sim N(0,\sigma^{2}) \).
This can also be written without referring to slopes as
\[ \begin{align} \textrm{RESPONSE} = & \textrm{CONSTANT} + \textrm{VARIABLE1} + \\ & \textrm{VARIABLE2} + \\ & \textrm{VARIABLE2*VARIABLE2} \end{align} \]
Linear model | Other name | Example study design |
---|---|---|
\( Y = \mu + X \) | Linear regression | Dose-response |
\( Y = \mu + A \) | One-way ANOVA | Completely randomized |
\( Y = \mu + A + b \) | Two-way ANOVA, no replication |
Randomized block |
Key:
Linear model | Other name | Example study design |
---|---|---|
\( Y = \mu + A + B + A*B \) | Two-way, fixed- effects ANOVA |
Factorial experiment |
\( Y = \mu + A + b + A*b \) | Two-way, mixed- effects ANOVA |
Factorial experiment |
\( Y = \mu + X + A \) | Analysis of covariance (ANCOVA) |
Observational study |
\( Y = \mu + X_{1} + X_{2} + X_{1}*X_{2} \) | Multiple regression | Dose-response |
Linear model | Other name | Example study design |
---|---|---|
\( Y = \mu + X \) | Linear regression | Dose-response |
\( Y = \mu + A \) | One-way ANOVA | Completely randomized |
\( Y = \mu + A + b \) | Two-way ANOVA, no replication |
Randomized block |
Example 17.3 (p. 551): Prairie Home Campion
Human activity is reducing species numbers in many of the world's ecosystems. Does this decrease affect basic ecosystem properties? Or are different plant species largely substitutable, with lost species compensated by those species remaining? To find out, Tilman et al. (2006) seeded 161 plots, each measuring \( 9 \times 9 \) meters, at the Cedar Creek Reserve in Minnesota. They used a varying number of prairie plant species and measured plant biomass production over 10 subsequent years. Treatments of either 1, 2, 4, 8, or 16 plant species (randomly chosen from a set of 18 perennials) were randomly assigned to plots. After 10 years of measurement, the researchers measured the “stability” of plant biomass production in every plot as mean biomass divided by the standard deviation in biomass over the 10 years (the reciprocal of the coefficient of variation).
\[ Y = \alpha + \beta X \]
\[ \mathrm{LOGSTABILITY} = \alpha + \beta(\mathrm{SPECIESNUMBER}) \]
Let's load and inspect the data.
prairie <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17e3PlantDiversityAndStability.csv")
str(prairie)
'data.frame': 161 obs. of 2 variables:
$ nSpecies : int 1 1 1 1 1 1 1 1 1 1 ...
$ biomassStability: num 7.47 6.74 6.61 6.4 5.67 5.26 4.8 4.4 4.4 4.26 ...
Null model: \( \mathrm{LOGSTABILITY} = \alpha^{\prime} \)
Regression model: \( \mathrm{LOGSTABILITY} = \alpha + \beta(\mathrm{SPECIESNUMBER}) \)
Fit models (perform regressions)
prairieNullModel <- lm(logStability ~ 1,
data = prairie)
prairieRegression <- lm(logStability ~ nSpecies,
data = prairie)
Compare models: Regression model better fit?
anova(prairieNullModel, prairieRegression)
Analysis of Variance Table
Model 1: logStability ~ 1
Model 2: logStability ~ nSpecies
Res.Df RSS Df Sum of Sq F Pr(>F)
1 160 24.815
2 159 19.298 1 5.5169 45.455 2.733e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: Same results as anova(prairieRegression)
Compare models: Regression model better fit?
anova(prairieRegression)
Analysis of Variance Table
Response: logStability
Df Sum Sq Mean Sq F value Pr(>F)
nSpecies 1 5.5169 5.5169 45.455 2.733e-10 ***
Residuals 159 19.2980 0.1214
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Linear model | Other name | Example study design |
---|---|---|
\( Y = \mu + X \) | Linear regression | Dose-response |
\( Y = \mu + A \) | One-way ANOVA | Completely randomized |
\( Y = \mu + A + b \) | Two-way ANOVA, no replication |
Randomized block |
Example 15.1 (p. 460): The knees who say night
Traveling to a different time zone can cause jet lag, but people adjust as the schedule of light to their eyes in the new time zone gradually resets their internal, circadian clock. This change in their internal clock is called a phase shift. Campbell and Murphy (1998) reported that the human circadian clock can also be reset by exposing the back of the knee to light, a finding met with skepticism by some, but hailed as a major discovery by others. A later experiment by Wright and Czeisler (2002) re-examined the phenomenon. The new experiment measured circadian rhythm by the daily cycle of melatonin production in 22 people randomly assigned to one of three light treatments: control, knees, and eyes.“
\[ \begin{align} Y & = \mu + X \ (\textrm{regression}) \\ Y & = \mu + A \ (\textrm{single factor ANOVA}) \end{align} \]
Notation | Regression | ANOVA |
---|---|---|
\( \mu \) | Intercept | Grand mean |
X | Numerical | – |
A | – | Categorical |
\[ Y = \mu + A \]
\[ \mathrm{SHIFT} = \mathrm{CONSTANT} + \mathrm{TREATMENT} \]
Let's load and inspect the data.
circadian <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter15/chap15e1KneesWhoSayNight.csv")
str(circadian)
'data.frame': 22 obs. of 2 variables:
$ treatment: Factor w/ 3 levels "control","eyes",..: 1 1 1 1 1 1 1 1 3 3 ...
$ shift : num 0.53 0.36 0.2 -0.37 -0.6 -0.64 -0.68 -1.27 0.73 0.31 ...
Null model: \( \mathrm{SHIFT} = \mathrm{CONSTANT} \)
Alternative model: \( \mathrm{SHIFT} = \mathrm{CONSTANT} + \mathrm{TREATMENT} \)
Fit models (perform ANOVAs)
circadianNullModel <- lm(shift ~ 1,
data = circadian)
circadianAltModel <- lm(shift ~ treatment,
data = circadian)
Compare models: Alternative model better fit?
anova(circadianNullModel, circadianAltModel)
Analysis of Variance Table
Model 1: shift ~ 1
Model 2: shift ~ treatment
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21 16.6398
2 19 9.4153 2 7.2245 7.2894 0.004472 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Definition: \( F \) measures the improvement in the fit of a general linear model to the data when the term describing a given explanatory variable is included in the model, compared with the null model in with the term is absent.
Compare models: Alternative model better fit?
anova(circadianAltModel)
Analysis of Variance Table
Response: shift
Df Sum Sq Mean Sq F value Pr(>F)
treatment 2 7.2245 3.6122 7.2894 0.004472 **
Residuals 19 9.4153 0.4955
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Linear model | Other name | Example study design |
---|---|---|
\( Y = \mu + X \) | Linear regression | Dose-response |
\( Y = \mu + A \) | One-way ANOVA | Completely randomized |
\( Y = \mu + A + b \) | Two-way ANOVA, no replication |
Randomized block |
Example 12.2: So macho it makes you sick?
In many species, males are more likely to attract females if the males have high testosterone levels. Are males with high testosterone paying a cost for this extra mating success in other ways? One hypothesis is that males with high testosterone might be less able to fight off disease. To test this idea, Hasselquist et al. (1999) experimentally increased the testosterone levels of 13 male red-winged blackbirds by surgically implanting a small permeable tube filled with testosterone. They measured immunocompetence as the rate of antibody production in response to a nonpathogenic antigen in each bird's blood serum both before and after the implant. The antibody rates were measured optically, in units of log \( 10^{-3} \) optical density per minute (ln[mOD/min]).
This is an example of a repeated measures design.
We have two explanatory variables:
(1) time with levels before and after transplant, which is a categorical fixed effect, and
(2) individual, which is a categorical random effect.
Thus, we are in the form \( Y = \mu + A + b \), or
\[ \mathrm{TESTOSTERONE} = \mathrm{CONSTANT} + \mathrm{TIME} + \mathrm{SUBJECT} \]
Let's load and inspect the data.
'data.frame': 13 obs. of 5 variables:
$ blackbird : int 1 2 3 4 5 6 7 8 9 10 ...
$ beforeImplant : int 105 50 136 90 122 132 131 119 145 130 ...
$ afterImplant : int 85 74 145 86 148 148 150 142 151 113 ...
$ logBeforeImplant: num 4.65 3.91 4.91 4.5 4.8 4.88 4.88 4.78 4.98 4.87 ...
$ logAfterImplant : num 4.44 4.3 4.98 4.45 5 5 5.01 4.96 5.02 4.73 ...
Discuss: Is this in wide or long format?
Answer: Wide!
Let's go long!
blackbird_long <- blackbird %>%
select(blackbird,
starts_with("log")) %>%
gather(time,
testosterone,
starts_with("log"),
factor_key=TRUE)
str(blackbird_long)
'data.frame': 26 obs. of 3 variables:
$ blackbird : int 1 2 3 4 5 6 7 8 9 10 ...
$ time : Factor w/ 2 levels "logBeforeImplant",..: 1 1 1 1 1 1 1 1 1 1 ...
$ testosterone: num 4.65 3.91 4.91 4.5 4.8 4.88 4.88 4.78 4.98 4.87 ...