BS2004 | Week 4

2024-01-29

Extending the LM further

Linear Models — a general framework

In a traditional framework, one would have

One-way ANOVA: one categorical EV with three or more levels (else t-test).
LM example: yield ~ fertiliser
One-way blocked ANOVA: one categorical EV and a blocking variable.
LM example: yield ~ block + fertiliser
Two-way ANOVA: two or more categorical EVs.
LM example: sq.blooms ~ water + shade
Regression: one continuous EV.
LM example: b.weight ~ av.food
Multiple regression: two or more continuous EVs.
LM example: b.weight ~ av.food + g.size + area

And what if one categorical EV and one continuous EV?

ANCOVA: Analysis of Covariance

LM brings these all under one framework.

Case Study

Effect of grazing on seed production

This example is from Crawley (2015), Statistics: an introduction using R. Wiley.

The Experiment

How does grazing affect seed production in scarlet gilia, Ipomopsis aggregata?

grazing is a two-level factor: grazed or ungrazed (protected by fences);
Plant size pre-grazing is recorded as the diameter of the top of its rootstock (root);
Weight of seeds produced by the plant is recorded (seed.wt) at the end of the growing season.

Expectations are:

Bigger plants produce more seeds than smaller plants.
Grazed plants produce fewer seeds than ungrazed.

Inspect the data

and try a naive ‘univariate’ analysis

Ipomopsis <- read.csv(file = "data/Ipomopsis.csv", stringsAsFactors=T)
names(Ipomopsis)  # check variable names

## [1] "root"    "seed.wt" "grazing"

levels(Ipomopsis$grazing)  # check names of `grazing` levels

## [1] "Grazed"   "Ungrazed"

# set "Ungrazed" as reference level, and set sum contrasts:
Ipomopsis <- within(Ipomopsis, {
  grazing <- factor(grazing, levels = c("Ungrazed", "Grazed"))
  contrasts(grazing) <- contr.sum
})

Expectation 1 — naive regression

Do bigger plants have more seeds?

m1 <- lm(seed.wt ~ root, Ipomopsis)  # LM for regression

Coefficients from model `m1`
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-41.28612	10.72259	-3.85039	0.00044
root	14.02235	1.46317	9.58354	0.00000

Yes, clearly it seems.

Expectation 2 — naive ANOVA

Does grazing reduce seed weight?

m2 <- lm(seed.wt ~ grazing, Ipomopsis)

Coefficients from model `m2`
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	59.4105	3.70219	16.04741	0.00000
grazing1	-8.5300	3.70219	-2.30404	0.02678

What!? — Grazed plants produced more seeds?

This seems very strange… can this be right?

What’s wrong here?

Our regression ignored the effect of grazing on seed.wt.
Our ANOVA ignored the effect of root size on seed.wt.

The line we’ve fitted does not fit the relationship within either group!
Plant size confounds the effect of grazing.

Time for ANCOVA

Analysis of Covariance

Yo, I’ll tell you what I want, what I really, really want So tell me what you want, what you really, really want I’ll tell you what I want, what I really, really want So tell me what you want, what you really, really want I wanna, (ha) I wanna, (ha) I wanna, (ha) I wanna, (ha) I wanna really, really, really wanna zigazig AH-N-COVA

Spice Girls, Wannabe (1996)

What we really want

We want to account for both EVs in a single model:

control for the effect of plant size (root) on seed.wt when comparing ungrazed and grazed
fit separate linear regressions to each of the two grazing groups, ungrazed and grazed.
Analysis of Covariance (ANCOVA) — root is the “covariate” or “nuissance variable”: similar to blocking!

LM framework has us covered again:

lm(seed.wt ~ root + grazing)

We get an ANCOVA simply by running anova() on the lm() fit.

Orthogonality check

m3 <- lm(seed.wt ~ root + grazing, Ipomopsis)
anova(m3)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
root	1	16795.0	16795.0	368.9	0
grazing	1	5264.4	5264.4	115.6	0
Residuals	37	1684.5	45.5

m3.b <- lm(seed.wt ~ grazing + root, Ipomopsis)
anova(m3.b)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
grazing	1	2910.4	2910.4	63.9	0
root	1	19148.9	19148.9	420.6	0
Residuals	37	1684.5	45.5

Order matters: data for ANCOVA are typically non-orthogonal.
Use adjusted SSQ.

Coefficients

Raw R output

summary(m3)

## 
## Call:
## lm(formula = seed.wt ~ root + grazing, data = Ipomopsis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1920  -2.8224   0.3223   3.9144  17.3290 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -109.778      8.318  -13.20 1.45e-15
## root          23.560      1.149   20.51  < 2e-16
## grazing1      18.052      1.679   10.75 6.11e-13
## 
## Residual standard error: 6.747 on 37 degrees of freedom
## Multiple R-squared:  0.9291, Adjusted R-squared:  0.9252 
## F-statistic: 242.3 on 2 and 37 DF,  p-value: < 2.2e-16

Remember that parameter estimates are independent of variable order in model, so m3 and m3.b give the same.

Interpretation?

Parameter estimates from `m3`.
Coef	In R output	Estimate	Std. Error	t value
\(\mu\)	(Intercept)	-110.0	8.32	-13.2
\(\alpha\)	root	23.6	1.15	20.5
\(\beta\)	grazing1	18.1	1.68	10.8

\[y = \mu + \alpha\times\mathtt{root} + \begin{bmatrix} \mathtt{grazing} & \mathrm{Deviation} \\ 1 & \beta \\ 2 & -\beta \end{bmatrix} + \epsilon\]

\(\mu\): mean of mean seed.wt in ungrazed and grazed plants (grand mean), adjusted for a root diameter of zero (!) — Whaaat? (Note the negative value: fictive…)
\(\alpha\): slope of the regression line(s): change in seed.wt per unit change in root diameter (in either group).
\(\beta\): Deviation of mean seed.wt in ungrazed \((+)\) and grazed \((-)\) plants from the grand mean, for any given root-stock diameter. Positive = higher in ungrazed.

Smarter Parametrisation

Can we get a more informative value for \(\mu\) (instead of mean for a fictive root stock diameter of zero…)? — Yes:

if we subtract mean(root) from root before fitting the model, \(\mu\) will be the mean seed.wt for an average-sized root! root.c now gives deviation from average root size.
This is known as centering a continuous EV on its mean (“means-centering”):

Ipomopsis <- within(Ipomopsis, {
  root.c <- root - mean(root)  # centered version of `root`
  })

m3.c <- update(m3, .~ root.c + grazing)  # Re-fit the model

New Estimates

Parameter estimates from `m3.c`.
Coef	In R output	Estimate	Std. Error	t value
\(\mu\)	(Intercept)	59.4	1.07	55.7
\(\alpha\)	root.c	23.6	1.15	20.5
\(\beta\)	grazing1	18.1	1.68	10.8

\[y = \mu + \alpha\times\mathtt{root.c} + \begin{bmatrix} \mathtt{grazing} & \mathrm{Deviation} \\ 1 & \beta \\ 2 & -\beta \end{bmatrix} + \epsilon\]

\(\mu\): Mean of the mean seed.wt in grazed and in ungrazed plants (grand mean), for plants that have average root diameter.
\(\alpha\) as before: change in seed.wt per unit change in root diameter.
\(\beta\) as before: deviation of mean seed.wt in ungrazed \((+)\) and grazed \((-)\) plants from the grand mean, for any given root-stock diameter.

\[y = \mu + \alpha\times\mathtt{root.c} + \beta\times\mathtt{grazing} + \epsilon\quad \mathrm{with}\quad \mathtt{grazing} = \begin{cases} +1&\mathrm{if\ ungrazed}\\ -1&\mathrm{if\ grazed}\end{cases}\]

Prediction

What is the predicted seet weight in ungrazed and grazed Ipomopsis with 6.5 cm roothead diameter?

recall that our model is after means-centering root, so we need to express rootstock diameter relative to the mean:

( my.x <- 6.5 - mean(Ipomopsis$root) )

## [1] -0.68115

ungrazed: \(\mu + \alpha\times-0.681 + \beta = 59.4 + 23.6\times-0.681 + 18.1 = 61.4\)
grazed: \(\mu + \alpha\times-0.681 - \beta = 59.4 + 23.6\times-0.681 - 18.1 = 25.2\)

Why are the lines parallel?

The parallel lines look reasonable enough here.

But why are they parallel?
Because we made them so!

This is how we have specified the model: there is only one regression slope in the model, so the lines cannot be but parallel.

Next week we will relax this assumption: interactions.

Summary

Univariate analyses can be grossly misleading.
If you know about / suspect any confounds, measure them and include them — if they are irrelevant, the LM will tell you!
- e.g., body size, age, …
“ANCOVA” is not just about controlling for continuous ‘confounds / nuissance’ variables: a categorical EV may also be masking the effect of a continuous EV of interest!
- e.g., consider including sex as EV.

The effect of (ignoring) multivariable effects are hard to predict! We will look at some scenarios in the next session.