2024-01-29

Extending the LM further

Linear Models — a general framework

In a traditional framework, one would have

  • One-way ANOVA: one categorical EV with three or more levels (else t-test).
    LM example: yield ~ fertiliser
  • One-way blocked ANOVA: one categorical EV and a blocking variable.
    LM example: yield ~ block + fertiliser
  • Two-way ANOVA: two or more categorical EVs.
    LM example: sq.blooms ~ water + shade
  • Regression: one continuous EV.
    LM example: b.weight ~ av.food
  • Multiple regression: two or more continuous EVs.
    LM example: b.weight ~ av.food + g.size + area

And what if one categorical EV and one continuous EV?

  • ANCOVA: Analysis of Covariance

LM brings these all under one framework.

Case Study

Effect of grazing on seed production

This example is from Crawley (2015), Statistics: an introduction using R. Wiley.

The Experiment

How does grazing affect seed production in scarlet gilia, Ipomopsis aggregata?

  • grazing is a two-level factor: grazed or ungrazed (protected by fences);
  • Plant size pre-grazing is recorded as the diameter of the top of its rootstock (root);
  • Weight of seeds produced by the plant is recorded (seed.wt) at the end of the growing season.

Expectations are:

  1. Bigger plants produce more seeds than smaller plants.
  2. Grazed plants produce fewer seeds than ungrazed.

Inspect the data

and try a naive ‘univariate’ analysis

Ipomopsis <- read.csv(file = "data/Ipomopsis.csv", stringsAsFactors=T)
names(Ipomopsis)  # check variable names
## [1] "root"    "seed.wt" "grazing"
levels(Ipomopsis$grazing)  # check names of `grazing` levels
## [1] "Grazed"   "Ungrazed"
# set "Ungrazed" as reference level, and set sum contrasts:
Ipomopsis <- within(Ipomopsis, {
  grazing <- factor(grazing, levels = c("Ungrazed", "Grazed"))
  contrasts(grazing) <- contr.sum
})

Expectation 1 — naive regression

Do bigger plants have more seeds?

m1 <- lm(seed.wt ~ root, Ipomopsis)  # LM for regression
Coefficients from model m1
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.28612 10.72259 -3.85039 0.00044
root 14.02235 1.46317 9.58354 0.00000


Yes, clearly it seems.

Expectation 2 — naive ANOVA

Does grazing reduce seed weight?

m2 <- lm(seed.wt ~ grazing, Ipomopsis)
Coefficients from model m2
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.4105 3.70219 16.04741 0.00000
grazing1 -8.5300 3.70219 -2.30404 0.02678


What!? — Grazed plants produced more seeds?

This seems very strange… can this be right?

What’s wrong here?

  • Our regression ignored the effect of grazing on seed.wt.
  • Our ANOVA ignored the effect of root size on seed.wt.

  • The line we’ve fitted does not fit the relationship within either group!
  • Plant size confounds the effect of grazing.

Time for ANCOVA

Analysis of Covariance

Yo, I’ll tell you what I want, what I really, really want So tell me what you want, what you really, really want I’ll tell you what I want, what I really, really want So tell me what you want, what you really, really want I wanna, (ha) I wanna, (ha) I wanna, (ha) I wanna, (ha) I wanna really, really, really wanna zigazig AH-N-COVA

Spice Girls, Wannabe (1996)

What we really want

We want to account for both EVs in a single model:

  • control for the effect of plant size (root) on seed.wt when comparing ungrazed and grazed
  • fit separate linear regressions to each of the two grazing groups, ungrazed and grazed.
  • Analysis of Covariance (ANCOVA) — root is the “covariate” or “nuissance variable”: similar to blocking!

LM framework has us covered again:

lm(seed.wt ~ root + grazing)

We get an ANCOVA simply by running anova() on the lm() fit.

Orthogonality check

m3 <- lm(seed.wt ~ root + grazing, Ipomopsis)
anova(m3)
Df Sum Sq Mean Sq F value Pr(>F)
root 1 16795.0 16795.0 368.9 0
grazing 1 5264.4 5264.4 115.6 0
Residuals 37 1684.5 45.5
m3.b <- lm(seed.wt ~ grazing + root, Ipomopsis)
anova(m3.b)
Df Sum Sq Mean Sq F value Pr(>F)
grazing 1 2910.4 2910.4 63.9 0
root 1 19148.9 19148.9 420.6 0
Residuals 37 1684.5 45.5

Order matters: data for ANCOVA are typically non-orthogonal.
Use adjusted SSQ.

Coefficients

Raw R output

summary(m3)
## 
## Call:
## lm(formula = seed.wt ~ root + grazing, data = Ipomopsis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1920  -2.8224   0.3223   3.9144  17.3290 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -109.778      8.318  -13.20 1.45e-15
## root          23.560      1.149   20.51  < 2e-16
## grazing1      18.052      1.679   10.75 6.11e-13
## 
## Residual standard error: 6.747 on 37 degrees of freedom
## Multiple R-squared:  0.9291, Adjusted R-squared:  0.9252 
## F-statistic: 242.3 on 2 and 37 DF,  p-value: < 2.2e-16

Remember that parameter estimates are independent of variable order in model, so m3 and m3.b give the same.

Interpretation?

Parameter estimates from m3.
Coef In R output Estimate Std. Error t value Pr(>|t|)
\(\mu\) (Intercept) -110.0 8.32 -13.2 0
\(\alpha\) root 23.6 1.15 20.5 0
\(\beta\) grazing1 18.1 1.68 10.8 0

\[y = \mu + \alpha\times\mathtt{root} + \begin{bmatrix} \mathtt{grazing} & \mathrm{Deviation} \\ 1 & \beta \\ 2 & -\beta \end{bmatrix} + \epsilon\]

  • \(\mu\): mean of mean seed.wt in ungrazed and grazed plants (grand mean), adjusted for a root diameter of zero (!) — Whaaat? (Note the negative value: fictive…)
  • \(\alpha\): slope of the regression line(s): change in seed.wt per unit change in root diameter (in either group).
  • \(\beta\): Deviation of mean seed.wt in ungrazed \((+)\) and grazed \((-)\) plants from the grand mean, for any given root-stock diameter. Positive = higher in ungrazed.

Smarter Parametrisation

Can we get a more informative value for \(\mu\) (instead of mean for a fictive root stock diameter of zero…)? — Yes:

  • if we subtract mean(root) from root before fitting the model, \(\mu\) will be the mean seed.wt for an average-sized root! root.c now gives deviation from average root size.

  • This is known as centering a continuous EV on its mean (“means-centering”):

Ipomopsis <- within(Ipomopsis, {
  root.c <- root - mean(root)  # centered version of `root`
  })

m3.c <- update(m3, .~ root.c + grazing)  # Re-fit the model

New Estimates

Parameter estimates from m3.c.
Coef In R output Estimate Std. Error t value Pr(>|t|)
\(\mu\) (Intercept) 59.4 1.07 55.7 0
\(\alpha\) root.c 23.6 1.15 20.5 0
\(\beta\) grazing1 18.1 1.68 10.8 0

\[y = \mu + \alpha\times\mathtt{root.c} + \begin{bmatrix} \mathtt{grazing} & \mathrm{Deviation} \\ 1 & \beta \\ 2 & -\beta \end{bmatrix} + \epsilon\]

  • \(\mu\): Mean of the mean seed.wt in grazed and in ungrazed plants (grand mean), for plants that have average root diameter.
  • \(\alpha\) as before: change in seed.wt per unit change in root diameter.
  • \(\beta\) as before: deviation of mean seed.wt in ungrazed \((+)\) and grazed \((-)\) plants from the grand mean, for any given root-stock diameter.

\[y = \mu + \alpha\times\mathtt{root.c} + \beta\times\mathtt{grazing} + \epsilon\quad \mathrm{with}\quad \mathtt{grazing} = \begin{cases} +1&\mathrm{if\ ungrazed}\\ -1&\mathrm{if\ grazed}\end{cases}\]

Prediction

What is the predicted seet weight in ungrazed and grazed Ipomopsis with 6.5 cm roothead diameter?

  • recall that our model is after means-centering root, so we need to express rootstock diameter relative to the mean:
( my.x <- 6.5 - mean(Ipomopsis$root) )
## [1] -0.68115
  • ungrazed: \(\mu + \alpha\times-0.681 + \beta = 59.4 + 23.6\times-0.681 + 18.1 = 61.4\)
  • grazed: \(\mu + \alpha\times-0.681 - \beta = 59.4 + 23.6\times-0.681 - 18.1 = 25.2\)

Why are the lines parallel?

The parallel lines look reasonable enough here.

  • But why are they parallel?
  • Because we made them so!

This is how we have specified the model: there is only one regression slope in the model, so the lines cannot be but parallel.

Next week we will relax this assumption: interactions.

Summary

  • Univariate analyses can be grossly misleading.
  • If you know about / suspect any confounds, measure them and include them — if they are irrelevant, the LM will tell you!
    • e.g., body size, age, …
  • “ANCOVA” is not just about controlling for continuous ‘confounds / nuissance’ variables: a categorical EV may also be masking the effect of a continuous EV of interest!
    • e.g., consider including sex as EV.

The effect of (ignoring) multivariable effects are hard to predict! We will look at some scenarios in the next session.